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Preface 


These are the proceedings of the 25" annual event of IDEAS: it marks a quarter century of 
organizing and holding these meetings in the series. This meeting had the further challenge of 
dealing, for the second year, with the Corona virus pandemic. Having endured the global lock 
down in 2020, we were looking forward to get together for this silver anniversary event in 
cosmopolitan Montreal. Unfortunatly the second and the third waves of covid called for a 
change in the mode of the meeting late in the call for papers giving us a shorter time for the fianl 
sprint! It is ironic that IDEAS in its span of 25 years had to contend first with SARS and, two 
years in a row with Covid. SARS was contained and we were able to hold an in-person meeting 
in Hong Kong; however, this time we had to switch, second year in a row, during the last 
weeks of the CFP due to the unending waves of the pandemic. It is heartening to learn that in 
spite of these challenges, we received 63 submissions plus 3 invited papers. This allowed us to 
continue to be selective! This meeting highlights the current per-occupation with AI, big data, 
block chain, data analytics, machine learning, the issues of a pandemic and the tyranny of the 
web; this is reflected in the accepted papers in these proceedings. 


We would like to take this opportunity to thank the members of our program committee, listed 
here, for their help in the review process. All the submitted papers were assigned to four 
reviewers and we got back over 3.2 reviews on the average due to the shorter review periods. 
The proceedings consist of 27 full papers(acceptance rate 43%), and 6 short papers (9%) . 


Acknowledgment: This conference would not have been possible without the help and effort of 
many people and organizations. Thanks are owed to: 


- ACM (Anna Lacson, Craig Rodkin, and Barbara Ryan), 

- BytePress, ConfSys.org, Concordia University (Will Knight and Gerry Laval), 

- Many other people and support staff, who contributed selflessly have been involved in 
organizing and holding this event. 


We appreciate their efforts and dedications. 


Bipin C. Desai Richard McClatchey Motomichi Toyoma, Jeffrey Ullman 
Concordia University Univ. Of West England Keio University Stanford University 
Canada U.K Japan USA 


Montreal, July, 2021 
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ABSTRACT 


In this paper we investigate within-record referential constraints 
on tree-structured data. We consider an SQL-like query language 
such that the one used in Dremel and we call it tree-SQL. We show 
how to define and process a query in tree-SQL in the presence 
of referential constraints. We give the semantics of tree-SQL via 
flattening and show how to produce equivalent semantics using the 
notion of tree-expansion of a query in the presence of referential 
constraints. 


CCS CONCEPTS 


- Information systems — Semi-structured data; Query opti- 
mization; Database query processing; 


KEYWORDS 
query processing, tree-structured data, referential constraints 


ACM Reference Format: 

Foto N. Afrati, Matthew Damigos, and Nikos Stasinopoulos. 2021. SQL-like 
query language and referential constraints on tree-structured data. In 25th 
International Database Engineering & Applications Symposium (IDEAS 2021), 
july 14-16, 2021, Montreal, QC, Canada. ACM, New York, NY, USA, 10 pages. 
https://doi.org/10.1145/3472163.3472184 


1 INTRODUCTION 


Analysis of large collections of complex data (e.g., tree-like, graph) 
has been made efficient thanks to the emergence of document- 
oriented databases (e.g., ElasticSearch, MongoDB) and data systems 
that combine a tree-structured data model and columnar storage, 
such as F1 [27], Dremel/BigQuery [21] and Apache Parquet. Rela- 
tional databases have caught up by adding support for hierarchical 
data types (e.g., the JSONB type in PostgreSQL and struct in Hive). 

In this paper, we draw inspiration from the Dremel data model [1, 
3, 21, 22] and use the theoretical tree-record data model introduced 
in [2] for representing collections of tree-structured records. Tree- 
record data model supports identity and referential constraints 
which can be defined within each tree-record (called within-record 
constraints). Identity and referential constraints have been studied 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than the 
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or 
republish, to post on servers or to redistribute to lists, requires prior specific permission 
and/or a fee. Request permissions from permissions@acm.org. 

IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 

© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. 
ACM ISBN 978-1-4503-8991-4/21/07. 
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extensively for the relational model and, later on, in the context 
of XML [5, 12, 15], graphs [14] and RDF data [7, 19]. Recently, 
identity constraints have been analyzed for JSON data models [4, 
24]. Unlike relational databases and XML, where constraints are 
used for data validation, in this work, they are exploited to enable 
query answering by incorporating in the language a feature that 
allows to call on such constraints. 

We consider an SQL-like query language, similar to the one used 
in Dremel, F1, Apache Drill, to query collections of tree-structured, 
and use flattening mechanism to transform the tree-structured 
records into relational records (i.e., unnest the nested and repeated 
fields in each record). Unlike the traditional flattening [3], which 
uses all the paths in the (tree) schema to generate the relational 
records, we use relative flattening [2] to flatten the tree-records 
with respect to the paths included the given query. Furthermore, we 
extend the relative flattening in order not only to take into account 
the within-record constraints during query answering, but also to 
arbitrarily navigate through multiple references (even in the case 
they form cycles in the schema). 

Although the semantics defined via full flattening [2, 3] offer a 
natural way to interprete SQL-like queries, it is not efficient. Full 
flattening can expand the amount of space necessary to hold the 
data. Thus it is preferable for the query to be viewed (if possible) as a 
tree query [3]. Then we can compute the result by using embeddings 
of the tree query to the tree that represents the data. This is what 
is offered in this paper. 


2 TREE-STRUCTURED RECORDS 


In this section we discuss the data model. 

We consider a table which consists of records, as in the relational 
databases, but the schema of each record is a tree instead of a list of 
attributes. The attributes that receive values in an instance of such 
a schema are the ones appearing in the leaves of the tree. We assign 
attributes to the internal nodes of the tree as well, which depict the 
structure of the record. All the attributes of the schema tree are 
associated with a group type which describes the structure of the 
subtree rooted at the specific attribute. ! The formal definition is 
as follows: 

A group type (or simply group) G is a complex data type defined 
by an ordered list of items called attributes or fields of unique names 
which are associated with a data type, either primitive type or group 
type. We associate an annotation with each field within a group 
type. This annotation denotes the repetition constraint. We describe 
details next. 


'TIn fact, group type could be thought of as an element in XML, an object in JSON, or 
as a Struct type in other data management systems (e.g., SparkSQL, Hive). 
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Some attributes/fields are allowed to receive more than one 
values in any instance associated with this group type and this is 
denoted by an annotation in the field name. In fact, we have four 
options as concerns the repetition constraint of a field; they are used 
to specify the number of times a field is allowed to be repeated 
within the group of its parent and this annotates the name of the 
field. Formally, the repetition constraint for a field N can take one 
of the following values with the corresponding annotation: 


required: N is mandatory, and there is no annotation, 

e optional: N is optional (i.e., appears 0 or 1 times) and is 
labeled by N?, 

e repeated: N appears 0 or more times and is labeled by Nx, 

required and repeated: N appears 1 or more times and is 

labeled by N+. 


We denote by repTypes the set {required, repeated, optional, 
required and repeated} of repetition types. 

Now a tree-schema [2, 3] is formally defined as a group type with 
no parent. It is easy to see that this definition of a tree-schema can 
be actually described by a tree as follows: 


Definition 2.1. (tree-schema) A tree-schema S of a table T is a 
tree with labeled nodes such that 


e each non-leaf node (called intermediate node) is a group and 
its children are its attributes, 

e each leaf node is associated with a primitive data type, 

each node (either intermediate or leaf) is associated with a 

repetition constraint in repTypes, and 

e the root node is labeled by the name T of the table. 


We define an instance of a tree-schema S. Considering a subtree 
s of S, we denote as dummy s the tree constructed from s by de- 
annotating all the annotated nodes of s and adding to each leaf a 
single child which is labeled by the NULL-value. 


Definition 2.2. (tree-instance) Let S be a tree-schema and t be 
a tree that is constructed from S by recursively replacing, from top 
to down, each subtree sy rooted at an annotated node No, where 
o is an annotation, with 


e either a dummy sy ork s@/-subtrees, if o = = (repeated), 
e either a dummy sy or a single s@/-subtree, if o =? (optional), 
ek s@_-subtree, if co = + (repeated and required), 


where k > 1 and ve is constructed from sy by de-annotating 
only its root. Then, for each non-NULL leaf N of t, we add to N 
a single child which is labeled by a value of type that matches the 
primitive type of N. The tree t is a tree-record of S. An instance of 


S, called tree-instance, is a multiset of tree-records. 


Example 2.3. Consider the table Booking with schema S depicted 
in the Figure 1. At this stage, we ignore the r-labeled edges, which 
will be defined in the next section. The Booking table stores data 
related to reservations; each record in the table represents a single 
reservation. As we see in S, the Booking group includes a repeated 
and required Service field (i.e., Service+) whose reachability path 
is Booking.Service; i.e., each booking-record includes one or more 


Note that a repeated field can be thought of as an array of elements (repeated types) 
in JSON structures. 


IDEAS 2021: the 25th anniversary 


Foto N. Afrati, Matthew Damigos, and Nikos Stasinopoulos 


services booked by the customer. The Type field describes the ser- 
vice type, is mandatory (i.e., required), and takes values from the 
set {accommodation, transfers, excursions}. Booking and Service 
groups could include additional fields, such as date the reservation 
booked, start and end date of the service, that are ignored here due 
to space limitation. Figure 2 illustrates a tree-record of S. 


2.1 Additional concepts 


In the rest of this section we give a few useful definitions. Each 
non-required node is called annotated node. When we de-annotate 
a node, we remove the repetition symbol from its label, if it is 
annotated. /b(N) represents the de-annotated label of a node N in 
a schema. Considering the nodes Nj, Nj of a schema S, such that 
Nj is a descendant of Nj, we denote by Nj.Ni+1..... Nj the path 
between Nj; and N; in S. The path of de-annotated labels between 
the root and a node N of a schema S is called reachability path of 
N. We omit a prefix in the reachability path of a node if we can still 
identify the node through the remaining path. In such cases, we say 
that the reachability path is given in a short form. To distinguish 
the complete form of a reachability path (ie., the one given without 
omitting any prefix) from its short forms, we refer to the complete 
form as the full reachability path. 

In the figures, all the dummy subtrees are ignored. The reacha- 
bility path of a node N in a tree-record is similarly defined as the 
path from the root of the tree-record to N. We assume that each 
node of both tree-schema and tree-record has a unique virtual id. 

We now define an instantiation in a multiset rather than in a 
set notion, which describes a mapping from the tree-schema to 
each tree-record in a tree-instance. Let S be a schema and t be a 
tree-record in an instance of S. Since each node of S is replaced 
by one or more nodes in t, there is at least one mapping p, called 
instantiation, from the node ids of S to the node ids of t, such that 
ignoring the annotations in S, both the de-annotated labels and 
the reachability paths of the mapped nodes match. The subtree 
of t which is rooted at the node p(N) is an instance of the node 
N, where N is a node of S. If N is a leaf, an instance of N is the 
single-value child of y(N). 

We say that two subtrees sj, s2 of a tree-record are isomorphic 
if there is a bijective mapping h from s; to sz such that the de- 
annotated labels of the mapped nodes match. We say that s; and s2 
of t are equal, denoted s; = sg, if they are isomorphic and the ids of 
the mapped nodes are equal. 


3 WITHIN-RECORD REFERENTIAL 
CONSTRAINTS 


In this section, we define the concepts of within-record identity 
and referential constraints for a tree-schema. Due to repetition 
constraints, we might have fields that uniquely identify other fields 
(or subtrees) in each record, but not the record itself [2, 6]. In Ex- 
ample 2.3, each service has an identifier which is unique for each 
service within a reservation, but not unique across all the reserva- 
tions in the Booking table. To support such type of constraints in a 
tree-record data model, we define the concept of identity constraint 
with respect to a group. 


Definition 3.1. (identity constraint) Let S be a tree-schema 
with root Rg, D be a tree-instance of S, and N, I be nodes of S such 
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Figure 1: Booking Schema - Tree-record model with refer- 
ences 


that N is intermediate and I is a descendant of N. Suppose that M 
is the parent of N, if N # Ro; otherwise, M = N = Ro. An identity 


1 
constraint with respect to N is an expression of the form I > N, 
such that I and all the descendants of I in S are required. We say 


that I —» N is satisfied in D if for each t € D and for each instance 
ty of M int, there are not two isomorphic instances of I in tp. I is 
called identifier and N is the range group of I. ; 

In the tree representation of a schema, for each I > N, we use 
the symbol # to annotate the identifier I (i-e., I#). We also use a 
special, dotted edge (J, N), called identity edge, to illustrate range 
group N of I. If N is the parent of J, we omit such an edge, for 
simplicity. 


Example 3.2. Continuing the Example 2.3, each reservation-record 
includes a list of passengers which is given by the field Passenger+. 
The Passenger includes the following 3 fields: 


e Passenger.Id, 
e Passenger.Name, 
e Passenger.Location_id, 
where the last one is optional for each passenger. Notice also 


that there is the identity constraint Passenger .Id sr Passenger, 
which means that the field Passenger.Id uniquely identifies the 
Passenger. Since the range group of the Passenger.Id is its parent, 
we ignore the corresponding identity edge. In Figure 2, we can see 
a tree-record that satisfies this constraint, since each Passenger 
instance has a unique Id. 

To see the impact of the range group, let us compare the following 


two constraints: Route — Transfer and Route -, Service. The 
field Route in the former case (i.e., the one illustrated in Figure 1) is 
a composite identifier of its parent and consists of two location ids; 
ie., the combination of From and To locations uniquely identifies 
the transfer instances within each service, but not across all the 
services of the booking. On the other hand, setting the range group 
of the Route to Service (i.e., the latter constraint), the combination 
of From and To locations are unique across all the transfer services 
in each booking-record. 


We now define the concept of referential constraint (or, simply 
reference), which intuitively links the values of two fields. In essence, 
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the concept of reference is similar to the foreign key in relational 
databases, but, here, it is applied within each record. 


Definition 3.3. (referential constraint) Let J, N and R be nodes 


1 
of a tree-schema S such that I — N, R is not a descendant of N, and 
I, Rhave the same data type. A referential constraint is an expression 


of the form R -> I. A tree-instance D of S satisfies the constraint 


R -> I, if for each tree-record t € D the following is true: For each 
instance trc, of the lowest common ancestor (LCA) of R and N in 
t, each instance of R in t is isomorphic to an instance of I in tzc,. 


If I is a leaf, then R —> Tis called simple. 


If we have R > I , we say that R (called referrer) refers to I (called 


referent). To represent the constraint R “, Tina tree-schema S, we 
add a special (dashed) edge (J, R), called reference edge, which is 
labeled by r. Let C be the set of identity and referential constraints 
over S. Consider now the tree B given by (1) ignoring all the refer- 
ence and identity edges, and (2) de-annotating the identifier nodes. 
We say that a collection D is a tree-instance of the tree-schema S$ 
in the presence of C if D is a tree-instance of B and D satisfies all 
the constraints in C. 


Example 3.4. Continuing the Example 3.2, we can see that the 


schema S depicted in Figure 1 includes two referrers of the Passenger.Id; 


Service.Passenger_Id and Trans fer.Passenger_Id store the ids of 
the passengers that booked each service and the ids of the passe- 
ngers taking each transfer, respectively. Furthermore, the Location.Id 
is a referent in four references defined, while the composite identi- 
fier Route consists of two fields that both refer to the Location Id 
identifier. 

Consider now the tree-record t of S which is depicted in Figure 2. 
It is easy to see that this reservation includes 3 services booked for 
two passengers. The first service is booked for the first passenger, 
while the services with ids 2 and 3 are taken by both passengers. In 
each service, there is a list of passenger ids representing the pas- 
sengers taking each service. Those fields refer to the corresponding 
passengers in the passenger list of the booking; appropriate refer- 
ence edges illustrate the references in t. 


4 SQL-LIKE QUERY LANGUAGE 


We use the Select-From-Where-GroupBy expressions used in Dremel 
[3, 21] to query tables defined through a tree-schema. We refer to 
such a query language as Tree-SQL. 

In detail, a query Q is an expression over a table J with schema 
S and tree-instance D in the presence of within-constraints C. We 
consider only simple references in C 3. The expression Q has the 
following form, using the conventional SQL syntax [17]: 
SELECT expr FROM 7 [WHERE cond] [GROUP BY grp], 
where cond is a logical formula over fields of S and grp is a list 
of grouping fields. cond, expr and grp are defined in terms of the 
leaves of S. expr is a list of selected leaves of S followed by a list of 
aggregations over the leaves of S. In the case that expr includes both 
aggregated operators and fields that are not used by aggregations, 
those fields should be present in the GROUP BY clause. Each leaf 


3 Querying schemas having references to intermediate nodes is considered a topic for 
future investigation. 
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Figure 2: Booking instance - Tree-record with references 


node in Q is referred through either its reachability path or a path 
using reference and identity edges implied by the constraints in C. 
To analyze the semantics of a query in more detail, we initially 
ignore the references and identifiers. 
We start with an example. Consider the following query Q over 
the table Booking with schema S depicted in Figure 1: 
SELECT Voucher, Destination, Operator.Name 


FROM Booking 
WHERE Operator.Country='GE' ; 


When Q is applied on an instance D of S it results a relation, also 
denoted Q(D), including a single tuple for each tree-record t in 
D such that the instance of the Operator.Country field in t is 
'GE'. Each tuple in Q(D) includes the voucher of the booking, the 
destination and the name of the operator (if it exists - otherwise, 
the NULL-value). For example, if D includes the tree-record illus- 
trated in Figure 2, then Q(D) includes the tuple (sONI1f FO, Greece, 
NULL). 

We give the semantics of the language via the concept of flatten- 
ing [2, 3] which is presented in Section 4.1. 


Observations and discussion: Although navigation languages (e.g., 
XPath, XQuery, JSONPath *) are used to query a single tree-structu- 
red document, here, we focus on a combination of a simple nav- 
igation language and SQL-like language to query collections of 
tree-structured data. 

In this work, we do not consider joins, recursion, nested queries 
and within-aggregations [21], as well as operations that are used to 
build a tree-like structure at query-time, or as a result of the query 
(e.g., the json_build-like functions in PostgreSQL). 

In the previous example, we can see that the fields used in both 
SELECT and WHERE clauses do not have any repeated field in their 
reachability path. Querying the instances of such kind of fields is 
similar to querying a relation consisting of a column for each field. 
The tuples are constructed by assigning the single value of each 
field, in each record, to the corresponding column. The challenge 
comes up when a repeated node exists in the reachability path of 


*JSONPath (2007). http://goessner.net/articles/JsonPath 


IDEAS 2021: the 25th anniversary 


a field used in the query; since such a field might have multiple 
instances in each tree-record. 


4.1 Flattening nested data 


In this section, we analyze the flattening operation applied on tree- 
structured data. Flattening is a mapping applied on a tree-structured 
table and translates the tree-records of the table to tuples in a 
relation. By defining such a mapping, the semantics of Tree-SQL is 
given by the conventional SQL semantics over the flattened relation 
(i.e., the result of the flattening over the table). Initially, we consider 
a tree-schema without referential and identity constraints. 

Let S be a tree-schema of a table T and D is an instance of S, 

such that there is not any reference defined in S. Suppose also that 
Nj,..., Nm are the leaves of S. The flattened relation of D, denoted 
flatten(D), is a multiset given as follows: 
flatten(D) = {{(Ib(p(Nj)), ...,lbGuCNm))) | is an instantia- 
tion of a tuple t € D}}. 
For each pair (Nj, Nj), (Ni) and 1(N;) belong to the same instance 
of the lowest common ancestor of Nj and N; in t. Considering now 
a query Q over S and an instance D of S, we say that Q is evaluated 
using full flattening, denoted Q(flatten(D)), if Q(D) is given by 
evaluating Q over the relation flatten(D). It’s worth noting here 
that if S does not have any repeated field then the flattened relation 
of D includes |D| tuples; otherwise, each record in D can produce 
multiple tuples during flattening. 


Example 4.1. Let T be a table with schema S depicted in Fig- 
ures 3(a), and instance D including only the tree-record depicted 
in Figures 3(b). The flattened relation flatten(D) is {{(Mi, V2, V4), 
(V1, V3, Va), (Vi, V5, NULL)}}. Consider now the query Q: SELECT 
Ni, N4 FROM T. Q typically applies a projection over the flattened 
relation; hence, it results three tuples; ie., {(Vi, V4), (Vi, Va), (V1, 
NULL)}. However, we see that N4 in D has two instances, V4 and 
NULL. The evaluation of Q using full flattening is affected by the 
repetition of N3. 


Now we motivate the definition of relative flattening with an 
example. Supposing the table 7 with schema and record depicted 
in Figures 1 and 2, respectively, we want to find the total price 
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Figure 3: Flattening, out-of-range references and cycles 


(i.e., sum(Service.Price)) for accommodation services. Although 
the total price is 1,800, we can see that using full flattening, the 
result of the query is 9, 000; due to the repetition of the Passenger 
and Location subtrees. 

To avoid cases where the repetition of a field that is not used in 
the query has an impact on the query result, we define the concept 
of relative flattening. Let S be a tree-schema of a table T and D is 
an instance of S, such that there is not any reference defined in S. 
Consider also a query Q over T that uses a subset L = {Nj,..., Ny} 
of the set of leaves of S, and the tree-schema S; constructed from S 
by removing all the nodes except the ones included in the reach- 
ability paths of the leaves in L. Then, we say that a query Q is 
evaluated using relative flattening, denoted Q(flatten(D, Q)), if 
QO(D) is given by evaluating Q over the relation: 
flatten(D,Q) = {{(lb(u(Nj4)),...,/b(uCN;))) | wis an instantia- 
tion from the nodes of S;, to the nodes of t € D}}. 

Continuing the Example 4.1, we have that Q(flatten(D,Q)) = 
{{(V1, V4), (Vi, NULL) }}. 


PROPOSITION 4.2. Consider a query Q over a tree-schema S such 
that Q does not apply any aggregation. Then, for every instance D of 
S, the following are true: 

(1) there is a tuple r in Q(flatten(D, Q)) if and only if there is a 
tuple r in Q(flatten(D)), 
(2) |O(flatten(D, Q))| < |Q(flatten(D))}. 


As we discussed previously, to flatten a tree-instance with respect 
to a given query (i.e., using relative flattening), we use all the paths 
included in the query expression, regardless of whether the paths 
are included in the select-clause or not. Such a remark is critical, 
due to our familiarity with conventional SOL semantics over the 
relational data model. In tree-record collections, which include 
repeated fields, even the paths of conditions in the where-clause 
that are always true might significantly affect the result of a query; 
and specifically, the multiplicity of the tuples resulted. To see this, 
consider, for example, the following query Q over the schema S 
defined in Figure 1. 


SELECT Voucher, Destination 
FROM Booking 
WHERE Service.Id = Service.Id; 


It is easy to see that the condition in where-clause is always true. 
Suppose now the query Q’ which is given from Q by removing the 
where-clause. Although a conventional SQL user would expect that 
QO(D) = Q’(D), for every tree-instance D of S, this is not true. In 
particular, if D = {{t}}, where t is the tree-record illustrated in 
Figure 2, we can easily see that Q(D) = {{(sONI1f FO, Greece) }}, 
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while Q’(D) includes the tuple (sONI1fFO0, Greece) 3 times; i.e., 
since the Service is a repeated field and the field Service.Id has 3 
instances in ft. 


4.2 Flattening with references 


In the previous section, we ignored the presence of constraints 
when we explained how to use flattening to answer a Tree-SQL 
query. Here, we show how we take advantage of the constraints to 
extend the query semantics based on the relative flattening. 

Let us start our analysis by looking at the schema S in Figure 1. 
Let D be an instance of S. Suppose now that we want to find, 
for all the transfer services in D, their vouchers, along with the 
following transfer information: vehicle of each transfer and the 
route expressed as a combination of From and To cities. Note that 
this query cannot be answered based on the query semantics defined 
in the previous section”, since the city of each location does not 
belong into the same Route subtree. Taking into account, however, 
the following constraints, it is easy to see that intuitively such a 
query could be answered. 


r 1 
From_Location_id — Location.Id Location.Id — Location 


To_Location_id “; Location.Id 
To see this, we can initially search for the voucher, vehicle, and 
ids of the From and To Locations for each transfer service within 
all the bookings. Then, for each id of the From and To Locations, 
we look at the corresponding Location list of the same record and 
identify the corresponding cities. To capture such cases and use the 
identity and referential constraints, we initially extend the notation 
of the Tree-SQL as follows. Apart from the reachability paths of 
the leaves that can be used in SELECT, WHERE and GROUP BY 


1 
clauses, if there are constraints R ”, Tand I > G over a schema S, 
we can use paths of the form: 


[pathToR].R.I.G.|pathToL], 


where the [pathToR] is the full reachability path of R, L is a leaf 
which is a descendant of G, and [ pathToL| is the path of de-annotated 
labels from G to L in S. We call such paths (full) reference paths. 
Similarly to reachability paths, we can also use a shortened form of 
the reference paths by using a short form [shortPathToR] of the 
reachability path [pathToR] and omitting the nodes I, G (which 
are implied by the constraints ); i.e., its short form is 


[shortPathToR|.R.|pathToL]. 


Hence, the query Q;; answering the previous question is: 
SELECT Voucher, Vehicle, Route.From_Location_id.City, 
Route. To_Location_id.City 
FROM Booking WHERE Service.Type = 'transfer'; 
Intuitively, navigating through identity and reference edges, the 
leaves of G become accessible from R. For example, in the schema S 
in Figure 1, the leaves of the group Location are accessible through 
both From_Location_id and To_Location_id. 

To formally capture queries using references, we extend the 
relative flattening presented in the previous section as follows. 


Definition 4.3. Let S be a tree-schema of a table T, D be an 
instance of S, and C be a set of identity and referential constraints 


°If the data is structured as in Figure 1, such a query cannot be answered using 
Select-From-Where-GroupBy queries in Dremel, as well. 
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satisfied in D. Consider also a query Q over T that uses a set of 
leaves L= {Nj,...,Nz,11,...,Lm} of S so that for each L; in L 


1 
there are constraints R; 3 Ij, Ij — G; in C, where G; is an ancestor 
of L;. Suppose the construction of the following schemas from S: 


e we construct a single tree-schema S ¢ from S by keeping only 
the reachability paths (without de-annotating any node) of 
the leaves in {Nj,..., Nz, Ri,...,Rm}, and 

e for each unique G; so that there is at least L; ¢ £, we con- 
struct a tree-schema Sg, which includes only the reachabil- 
ity paths of R;, J; and the leaves of G; that are included in 
1 iicasennckapet: 

Both S ¢ and SG, keep the node ids from S. A query Q is evaluated 
in the presence of the constraints in C, denoted Q( flatten(D, Q,C)), 
if Q(D) is given by evaluating Q over the relation: 
flatten(D,Q,C) = {{(b(H(N1)), ..-, IbGHCNg)), [bu (11), «+ 
Ib(ptm(Lm))) | 

e jis an instantiation from the nodes of Sy to the nodes of 
rey), 

e each pj; is an instantiation from the nodes of Sc, to the nodes 
of ft, 

e for every two Lj, L; s.t. Gj = G; and Rj = Rj, we have that 
Hi = pj, and 

e for each i, we have that (Rj) = pi(Rj) and lb(;(R;)) = 
Lb( pi Ii)) } }- 


Posing Q;, (defined above) on an instance D including the tree- 
record depicted in Figure 2, we have two instantiations, each of 
which maps on a different instance of Transfer subtree. For each 
such instantiation, there is a single instantiation to a Location in- 
stance so that the referrer value equals the Location.Id value. The 
result Qi, (D) is: {(sONI1f FO, Train, Athens, Chalcis), (sONI1f FO, 
Bus, Chalcis, Kymi)}. If we replace Route.To_Location_id.City with 


the field Location.City, the reference To_Location_id ”, Location.Id 
is not used; hence, the result includes 6 tuples computed by combina- 
tion of the 2 cities of From-location, Athens and Chalcis, and all 
the available instances of the Location.City. 


Example 4.4. Consider the table T with tree-schema S and in- 
stance D = {{t}}, where S and t are depicted in the Figure 4. Notice 
that the virtual id of each node is depicted next to the node in each 
tree. Let Q be a query over T defined as follows: 


SELECT Ng, Ns, Ns.N3, Ng.N3 
FROM T 
WHERE No > 2; 


The result of the query Q when it is posed on D (i.e., Q(D)) is 
illustrated in Table 2a (the first column is just a sequential number 
that is not included in the result). To compute Q(D), we initially 
find the paths used in Q. These paths are summarized in Table 1, 
where both short and full forms of the paths (reachability and ref- 
erence paths) are included; for simplicity, we omit the annotations. 


Following the flattening definition presented earlier in this sec- 
tion, we firstly construct the schema S ¢ depicted in the Figure 4, 
where £ = {No, Ns, No, Ng}. The node Ng is included in the list 
since Ng. N3 is included in the select-clause of Q; this path is given 


1 
due to the constraints Ng a3 N2 and No — Nj. Ns is included in 
£ due to the paths Ns. N3 and Ns in the select-clause of Q. Then, 
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# | Short paths | Full paths Path type Clause 

1 | Ne T.N4.Ne6 Reachability path | SELECT 
2 | Ns T.N4.N5 Reachability path | SELECT 
3 | Ns5.N3 T.N4.Ns5.N2.N1.N3 | Reference path SELECT 
4 | Neg.Ns T.N7.Ng.N2.N,.N3 | Reference path SELECT 
5 | No T.No Reachability path | WHERE 


Table 1: Paths of Q 
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Figure 4: Schema construction w.r.t to a query 


we construct the schemas S; and Sz for the referrers Ng and Ns, 
respectively, which are also depicted in Figure 4. Finding now the 
instatiations of the schemas S;, Sz and S ¢ over t we get the result 
O(D). Table 2b includes the instatiations for each schema that are 
used to compute the answer (#1) in Table 2a; e.g., the first instati- 
ation maps the node Np of the schema S to the node with id 17 
in t. Next, we join these instatiations so that the instatiation of S 
joins with the instatiations of both S; and Sz on the nodes Ng and 
Ns (included in both schemas), respectively. The result is given by 
projecting the join result on the nodes included in the select-clause 
of Q, i.e., No, Ns, N3 (through Ns), N3 (through Ng), and getting 
their primitive values in t. 


Sr 17 7 8 10 | 14 | 15 
Sy 4 5 6 14 | 15 
S 1 2 3 7 8 


(b) Instatiations over ¢ that produce the first record in Table 2a. 


Table 2: Result of QO and instantiations of the schemas Sj, S2 
and S ¢ in Figure 4. 


4.3 Well-defined referential constraints 


When we define a reference constraint we should also make sure 
that it is well-defined in the following sense. If the reference is 
within-range then we say that we have a well-defined reference. 
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Definition 4.5. Let S be a tree-schema, and R ae II 4, G be two 
constraints over S. Let L be the lowest common ancestor of G and 
R. Then, we say that the reference R *; Lis out-of-range if there is 
at least one repeated group on the path from L to G; otherwise, the 
reference is within-range. 


In the following example, we explain what goes wrong when a 
reference constraint is out-of-range. 


Example 4.6. Consider the tree-schema S depicted in Figure 3(c) 
and an instance D of S including the tree-record in Figure 3(d). As 
we can see, there are 2 referrers, Rj and Rj, which both refer to 
the identifier J. The range group of I is the node N3. Consider the 
queries Q, and Qz selecting only the fields R; and Ro, respectively. 
Note that the result Q2(D) includes two times the value 1, while 
Qi (D) includes the value 1 once. Let now Q/ and Q; be the queries 
selecting the paths R,.N and R2.N, respectively. We can see that 
both Q;(D) and Q5(D) include the tuples (3) and (5). Hence, 
when we use the reference from R2, the number of tuples in the 
result remains the same. Using however the reference from Rj, the 
number of tuples in the result increases. This is because it is not 
clear which is the instance of J that the instance of R; refers to. This 
property is captured by the following definition and proposition. 


ProposITION 4.7. Let C be a reference R “, I over a tree-schema 
S, and t be a record of a tree-instance D of S satisfying the constraint. 
IfC is within-range, then for each instance of R in t, there is a single 
instance of I int. 


The property described in the Proposition 4.7 is very important 
for defining referential constraints, since setting up out-of-range 
constraints the queries using the references might not compute the 
“expected” results as it is demonstrated in Example 4.6. 


5 ANSWERING QUERIES USING 
TREE-EXPANSIONS 


In this section, we present an alternative method for computing the 
result of a query, which is based on the concepts of tree-expansion 
and embedding. In the next subsection, we first present this method 
when there is only one reference in each path. Then, we extend in 
Section 5.2 to multiple references. 


5.1 Tree-expansion for a single reference 


Let us initially define the concept of tree-expansion, which typi- 
cally describes a query tree constructed from the tree-schema, the 
constraints and the query expression. 


Definition 5.1. Let C be a set of constraints over a schema S of 
a table T and Q be a query over T. We consider that all the paths 
(both reachability and reference) in Q are given in their full form. 
We construct a tree Q; from all the paths occurring in Q as follows: 


(1) Each path occurring in Q is a path of Q;, and there is no path 
in Q; that does not occur in Q. 

(2) Each node in Q; has a distinct reachability path. 

(3) Each leaf node has a child-node labeled by a distinct variable. 


The tree Q; is called the tree-expansion of Q and is denoted as 


Gr(Q). 
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We consider that each node of Gr(Q) is associated with a unique 
identifier, not necessarily the same with the one given in S. Typically, 
the tree-expansion of a query Q over a schema S is constructed 
as follows. We collect all the paths of Q, in their full form, and 
construct a tree from these paths so that all common nodes of the 
paths are merged. Note here that there might be multiple, different 
reference paths in Q referring to the same node in S. In such a case, 
we do not merge the suffixes of those reference paths in Gr(Q), but 
keep them separated °. Gr(Q) can also be constructed from S by 
keeping the paths occurring in Q (and de-annotating each node), 
and removing all the nodes and edges of S that are not included in 
any path of Q. 


Example 5.2. Consider the tree-schema S and the query Q over 
S, both introduced in Example 4.4. To construct the tree-expansion 
Gr(Q) of Q, we initially take the full form of all the paths in Q. 
These paths are listed in Table 1. 

Combining these paths we construct the tree-expansion Gr(Q) 
of Q, which is illustrated in Figure 5. As you can see, the first three 
paths in the Table 1 have the same prefix, which is given by the sub- 
path T.N4. These paths are merged and share the same sub-path 
T.N4 in Gr(Q). In addition, although there is a single subtree in S 
that is rooted at the node N;*, Gr(Q) has two paths N2.N;.N3, one 
rooted at Ns and one rooted at Ng. These two paths are constructed 
from the reference paths (#3) and (#4) in Table 1. One could say 
that the two paths Nz.N;.N3 could be merged. Merging however 
these paths, we miss records in the result as we will see later in this 
section. 


, 


- 
~~ eo 


Tree-Expansion Gr(Q) 


Figure 5: Tree-expansion of Q in Example 4.4. 


Let us now see how the tree-expansion can be used to alterna- 
tively find the flattened relation, and consequently answer a query. 
To see this, we define the concept of embedding, which, in essence, 
describes a mapping from a tree-expansion to a tree-record. 


Definition 5.3. Let Q be a query defined on a table T with schema 
S and instance D. Suppose a mapping h from the nodes of Gr(Q) 
to the nodes of t € D so that 
e for each node n of Gr(Q), if n is not a variable then /b(n) = 
lb(h(n)), 
e for each variable v of Gr(Q), h(v) is a leaf of t, 
e h(r) = rz, where r and r; are the roots of Gr(Q) and f, 
respectively, and 


°Otherwise, we would not have a tree but a graph. 
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e for each edge (1, n2) of Gr(Q), there is an edge (h(n), h(nz)) 
in t. 
Then, h is called an embedding from Q to t. The multiset {{(1b(h(X1)), 
..., 1b(h(X;)))| his an embedding from Q to a tree-record t € D 
and Xj,...,X; are the variables of Gr(Q)}} is called the relational 
result of Gr(Q) over D and is denoted as Gr(Q, D, C). 


The following theorem shows that evaluating Q over the rela- 
tional result of Gr(Q) equals the evaluation of Q over the flattened 
relation. 


THEOREM 5.4. Let Q be a query over a table with schema S and a 
set of constraints C over S Then, for each instance D of S we have that 


flatten(Q, D,C) = Gr(Q, D,C) andQ(D) = Q(flatten(Q, D,C)) = 


Q(Gr(Q, D, C)). 


Example 5.5. Continuing the Example 5.2, let us see how the 
result of Q described in Example 4.4 is computed using the tree- 
expansion of Q. Here, we focus on describing how the first answer 
included in Table 2a is computed. 

We recall that the tree-expansion Gr(Q) of Q is depicted in the 
Figure 5. Computing now the embeddings from Gr(Q) to t, we are 
able to find the embedding h illustrated in Figure 6 (for simplic- 
ity, we illustrate only the mapping of the nodes included in the 
SELECT-clause), where the root of Gr(Q) maps to the root of t (ie., 
h(0) = 0), and 


h(1) = 17, h(4) = 2, h(9) = 10, h(13) =5, 
h(2) = 7, h(5) = 1, h(11) = 14, h(14) = 4, 
h(3) = 8, h(6)=3,  h(12)=15,  —-A(15) = 6. 


As we can see, the path T.N4.N5.N2.N1.N3 (i.e., 0.2.3.4.5.6, in terms 
of node ids) of Gr(Q) maps on the path 0.7.8.2.1.3 of t (given in 
terms of node ids), where the edge (8, 2) is a reference edge and the 
edge (2,1) is an identity edge. Similarly, the path 0.11.12.13.14.15 
(ie., T.N7.Ng.N2.N1.N3) of Gr(Q), given using node ids, maps on 
the path 0.14.15.5.4.6 of t, where the edge (15,5) is a reference 
edge and the edge (5, 4) is an identity edge. Using the embedding h, 
the tuple (X1, X2, X3, X4, X5) maps on the tuple of primitive values 
(5, 1, 3, 4,3), which is the same answer computed in Example 4.4 and 
gives the first answer in Table 2a. In Figure 6, the primitive values 
in t that are mapped by the variables of the Gr(Q) and included 
into the relational result of Gr(Q) are circled. 


It’s worth noting, here, that tree-expansion could not include 
multiple paths with identical labels (see condition (2) in Defini- 
tion 5.1); ie., even if a single path appears multiple times in the 
query, these paths give a single path in the tree-expansion. Con- 
sequently, querying the instances of a repeated field so that these 
instances are included in different columns is not supported by 
the semantics presented in this paper. For example, we cannot ask 
the table T with schema S and instance D = {{t}}, where t and 
S are depicted in Figure 5, for pairs of N5 values (i.e., posing a 
query that results the tuple (1,2), which corresponds to the pair of 
nodes 8 and 9 in t). Notice also that tree-expansion results, through 
embeddings, not only the primitive values mapped by the paths 
included in the select-clause but also the values mapped by the 
paths included in the where-clause. The latter type of values (e.g., 
the values mapped by node No in Example 5.5) ensure that the 
conditions included in the where-clause can be checked when the 
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query is evaluated over the flattened relation (i.e., the relational 
result of the tree-expansion). Note that finding an optimized flat- 
tening and query evaluation algorithm, which could validate the 
where-clause conditions during flattening, is considered a topic for 
future investigation. 


5.2 Tree-expansion for multiple references 


In this section, we show how we use tree-expansions to extent the 
query semantics and navigate through multiple references, and 
cycles of references. In particular, since multiple references can 
be defined in a single schema, there might be paths in the schema 
given by navigating more than one path. For instance, looking at 
the schema S depicted in Figure 1, we can see that there is a path P 
from root node Booking to Location.Country through the following 
path (we omit, here, the annotations for simplicity): 


Service.Passenger_id.Id.Passenger.Location_id.Id.Location. 


The path P uses two references, hence, it is not allowed to be used 
in a query, according to what we have discussed in the previous 
sections. Even if it was allowed, flattening a query which uses 
such a path cannot be handled by the flattening approach given in 
Section 4.2. Let us now see an additional, complex example, where 
the tree-schema includes cycles of references. 


Example 5.6. Consider the table Dept with tree-schema S and 
instance D = {{t}}, where both S and t are illustrated in Figure 7. 
The table Dept stores the projects (Proj) of each department, along 
with the employees (Empl) of the department. Each project has 
a number of employees working on it, and each employee of the 
department might be accountable for (AccFor) a list of projects. 
Hence, it is easy to see that the references defined between Proj 
and Empl subtrees form a cycle of references. One could ask for the 
projects of a certain category (e.g., Catg = 'Cat2') which employ 
an employee who is accountable for a project of a different category, 
along with the name of the employee. To answer sucha question, we 
need to navigate through both links. The query semantics defined 
in Section 4.2 allow only a single use of a reference between two 
subtrees. 


To handle such cases, we initially extend the paths supported 
in query expressions. In particular, considering a tree-schema S 


and the constraints R; nae I; and J; a G; over S, with j = 1,...,n, 
which are included in a set C, we generalize the reference paths to 
be of the following (full) form: 
[p(R, Ri) ]-Ri-h-G1.[p (Gi, Li) ]-Ro-In-G2.[p(Gi, li) ]...-. RnIn-Gn-[P (Gn Ln) |; 
where R is the root of S, p(A, B) is the de-annotated path in S from 
the node A to the node B, and L; is a descendant of G;. To dis- 
tinguish these reference paths from the paths that use a single 
reference (defined in Section 4.2), we refer to the former as gen- 
eralized reference paths and to the latter as simple reference paths. 
Similarly to the simple reference paths, we also consider a shortened 
form of the generalized ones by omitting a prefix of the reachability 
path of Rj (ie., p(R, R1)) and all the nodes Jj, Gj, with j = 1,...,n. 
Suppose a schema S, a set of constraints C over S, and an instance 
D of S. To compute now a query Q over S that uses generalized 
reference paths, we simply use the tree-expansion and find the 


SQL-like query language and referential constraints on tree-structured data 
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Figure 6: Embedding of a tree-expansion with a single reference in each path. 
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Figure 7: Schema with a cycle of references and the tree- 
expansions of the queries Q; and Q2 defined in Example 5.6. 


embeddings from the tree-expansion over each tree-record. In par- 
ticular, since the definition of the tree-expansion supports genera- 
lized references paths, we construct the tree-expansion Gr(Q) 
of Q and evaluate Q over its relational result Gr(Q, D,C); i-e., 
Q(D) = Q(Gr(Q, D, C)). 


Example 5.7. Using the generalized reference paths, the question 
posted in Example 5.6 (for the category Catg = 'Cat2') can be 
answered by the following query Q1’: 


Note that the alias of the columns are used as in conventional SQL. 
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SELECT Proj.Id as X, Proj.Name as Y, Proj.Empl.Name as Z 
Proj.Empl.AccFor.Proj.Name as W 
FROM Dept 
WHERE Proj.Catg = 'Cat2' AND 
Proj.Empl.AccFor.Proj.Catg <> 'Cat2'; 


To construct now the tree-expansion, we initially collect the 
full form of all the paths used in the query Q. These paths are 
listed in Table 3a. Notice that the path (#3) is a simple reference 
path that uses a single reference, while the paths (#4) and (#6) 
are generalized reference paths that go through two references. 
Combining these paths, we construct the tree-expansion Gr(Q}) of 
Q; which is depicted in Figure 7. Computing now the embeddings of 
Gr(Q}) over t, we have the resulting relation presented in Table 3b. 


Full paths Path type Clause 
1 Dept.Proj.Id Reachability SELECT 
2 Dept .Proj.Name Reachability SELECT 
3 Dept .Proj.Empl.Id.Emp1.Name Reference SELECT 
4 Dept.Proj.Empl.Id.Empl.AccFor.Id.Proj.Name Reference SELECT 
5 Dept .Proj.Catg Reachability | WHERE 
6 Dept.Proj.Empl.Id.Empl.AccFor.Id.Proj.Catg Reference WHERE 
(a) Paths of Q). 
PTT Zz 


Pr3 aa 2 Pr2 
Pr3 Name 2 Pri 


(b) eit QO,(D) of Q; over t. 


Table 3: Evaluating Q; over t. 


Let us now try to find the the employees who are accountable for 
a project, the project they are accountable for, and the employees 
working on the certain project. To answer this question, we need 
to start our paths from the Empl field and go through projects, in 
order to find the projects each employee is accountable for, and 
then navigate to the employees of each such project to find the 
team members. In particular, the query Q2 answering this question 
is given as follows: 


SELECT Empl.Id as X1, Empl.Name as X2, 
Empl.AccFor.Proj.Name as X3, 
Empl.AccFor.Proj.Empl.Name as X4 

FROM Dept 

WHERE Empl.AccFor is not NULL AND 

Empl.Id <> Empl.AccFor.Proj.Empl.Id; 


To answer the query Q2, we construct the tree-expansion Gr(Q2) 
which is depicted in Figure 7. The result Q2(D) is give in Table 4. 


As we can see in the tree-expansions of Q; and Q2 in Example 5.7, 
we start unfolding the paths from a different field (from the Proj 
field in Gr(Q,) and from the Empl for Gr(Qz)). Typically, in cases 
that we have cycles of references, we can use a path of arbitrary 
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nO PWD |] Ht 


Table 4: Result of Q2 over t. 


length in the query, navigating through the paths multiple times. 
These paths, however, should be clearly specified in the query; i.e., 
we do not consider, here, any recursive operator as in XPath. 


6 RELATED WORK 


Work on constraints for tree-structured data has been done during 
the past two decades. Our work, as regards the formalism, is closer 
to [5, 6, 28, 29]. The papers [6] and [5] are among the first works on 
defining constraints on tree-structured data. Reasoning about keys 
for XML is done in [6] where a single document XML is considered 
and keys within scope (relative keys) are introduced. Referential 
constraints through inclusion dependencies are also investigated 
(via path expression containment). The satisfiability problem is 
investigated, but no query language is considered. Many recent 
works investigate discovering conditional functional dependencies 
in XML Data; closer to our perspective is [28] and [29] where XML 
schema refinement is studied through redundancy detection and 
normalization. 

[24] and [4] focus on the JSON data model and a similar to 
XPath navigational query language. These works also formalize 
specification of unique fields and references, they do not define 
relative keys. [24] formally defines a JSON schema. It supports 
specification of unique fields within an object/element and supports 
references to an another subschema (same subschema can be used 
in several parts of the schema). No relative keys are supported. 
[4] continues on [24] and proposes a navigational query language 
over a single JSON document (this language presents XPath-like 
alternatives for JSON documents, such as JSONPath, MongoDB 
navigation expressions and JSONiq). 

Flattening has initially been studied in the context of nested 
relations and hierarchical model (e.g., [8, 23, 26]). Dremel [3, 21, 
22], F1 [27] and Drill use flattening to answer SQL-like queries 
over tree-structured data. Flattening semi-structured data is also 
investigated in [10, 11, 20], where the main problem is to translate 
semi-structured data into multiple relational tables. 


7 FUTURE WORK 


Towards future work, we want to use tree-expansion of the query 
to distinguish cases where the output of the query can also be given 
as tree-structured data (because not every flattened data can be 
unflattened to a tree structure). Whenever this is possible it may 
enable a shorter form of flattening, called semi-flattening which 
can be used in conjunction of columnar storage to answer queries 
efficiently [3] . In addition, we plan to investigate querying tree- 
schemas having references to intermediate nodes. Also, we aim to 
study flattening when the referrer is defined in the range group 
of the referent. Furthermore, we plan to extend this investigation 
towards the following directions: a) Study the satisfiability and the 
implication problems for the constraints we defined here. b) The 
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chase [25] is used to reason about keys and functional dependencies. 
For relational data, there is a lot of work on chase. The chase for 
RDF and graph data was studied in [18], [13, 16], [9] and [14]. We 
plan to define a new chase that can be applied to reason about the 
constraints we defined here. 
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ABSTRACT 


Recommendation methods fall into three major categories, content 
based filtering, collaborative filtering and deep learning based. In- 
formation about products and the preferences of earlier users are 
used in an unsupervised manner to create models which help make 
personalized recommendations to a specific new user. The more 
information we provide to these methods, the more likely it is that 
they yield better recommendations. Deep learning based methods 
are relatively recent, and are generally more robust to noise and 
missing information. This is because deep learning models can 
be trained even when some of the information records have par- 
tial information. Knowledge graphs represent the current trend in 
recording information in the form of relations between entities, and 
can provide any available information about products and users. 
This information is used to train the recommendation model. In 
this work, we present a new generic recommender systems frame- 
work, that integrates knowledge graphs into the recommendation 
pipeline. We describe its design and implementation, and then show 
through experiments, how such a framework can be specialized, 
taking the domain of movies as an example, and the resulting im- 
provements in recommendations made possible by using all the 
information obtained using knowledge graphs. Our framework, to 
be made publicly available, supports different knowledge graph 
representation formats, and facilitates format conversion, merging 
and information extraction needed for training recommendation 
models. 


CCS CONCEPTS 


- Information Systems — Recommender Systems; Semantic 
web description languages; e Computing Methodologies — Neu- 
ral Networks; - Software and its engineering — Development 
frameworks and environments. 
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1 INTRODUCTION 


Propelled by the current pandemic, we start to see a very steep 
rise in the use of the Internet to support different human activities. 
There are more and more product/service offerings and providers 
on the Internet. While user acceptance has jumped, information 
overload has become a major problem for most Internet users. And 
this is where recommender system comes into play [9]. The recom- 
mender system is essentially programmatic support to significantly 
trim the information of interest to a user from the massive amount 
of information available on the Internet [3]. Applications of recom- 
mender systems are very wide. According to reports, recommender 
systems have brought 35% of sales revenue to Amazon [1] and up 
to 75% of consumption to Netflix [2], and 60% of the browsing on 
the Youtube homepage comes from recommendation services [4]. It 
is also widely used by various Internet companies. A recommender 
system is useful as long as there are a large number of items to 
offer to the clients [12, 17, 18]. The current application domains of 
recommender systems have transcended beyond e-commerce, into 
news, video, music, dating, health, education, etc. 

A recommender system is essentially an information filtering 
system. It "learns" the users’ interests and preferences based on 
historical behaviour of items and users, and predicts for any specific 
user the rating or preference for a given specific item, based on 
information about the item, and the user. Clearly, the more infor- 
mation, the recommendation method has about users and products, 
the better is its ability to predict. Given the vast amount of unstruc- 
tured, often noisy, redundant and incomplete information available 
on the Internet, convenient ways to gather, structure and provide 
such information is the major challenge addressed in this work. 
Specifically, we present a new generic software framework which 
enables easy integration of any available information about items 
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and users, by including such information into a knowledge graph 
to be used for training the recommendation method. 

In most recommendation scenarios, items may have a lot of as- 
sociated knowledge in the form of interlinked information, and the 
network structure that depicts this knowledge is called a knowl- 
edge graph. Information in a knowledge graph is encoded in a data 
structure called triples made of subject-predicate-object relations. 
A knowledge graph greatly increases information about the item, 
strengthens the connection between items, provides a rich reference 
value for the recommendation, and can bring additional diversity 
and interpretability to the recommendation result. 

In our opinion, we need a general framework, which (i) integrates 
search and update of information, (ii) enables crawling of websites 
for additional information, (iii) supports storing of the information 
in a structured, easily accessible manner, (iv) enables easy retrieval 
of the information about items and users as input for the training 
of the recommender model. Adding a knowledge graph into the 
recommendation framework can help us better manage knowledge 
data, process data, and query the information we need faster. For 
one, knowledge graphs as a form of structured human knowledge 
have drawn great research attention from both the academia and 
the industry [14, 21, 25]. A knowledge graph is a structured repre- 
sentation of facts, consisting of entities, relationships, and semantic 
descriptions. Knowledge graphs can be used wherever there is a 
relationship. Knowledge graphs have successfully captured a large 
number of users, including Walmart, Google, LinkedIn, Adidas, HP, 
FT Financial Times, etc. Applications continue to grow. 

Compared with traditional data bases and information retrieval 
methods, the advantages of a knowledge graph are the following: 


e Strong ability to express relationships: Based on graph theory 
and probability graph models, it can handle complex and 
diverse association analyses. 

e Knowledge learning: it can support learning functions based 
on interactive actions such as reasoning, error correction, 
and annotation, and continuously accumulates knowledge 
logic and models, improves system intelligence. 

e High-speed feedback: Schematic data storage method en- 
ables fast data retrieval speeds. 


Knowledge graphs usually have two main types of storage for- 
mats [10, 26]: one is RDF (Resource Description Framework) based 
storage, and the other is graph database (e.e., neo4j), with their 
advantages and disadvantages. 

Secondly, recommender systems have become a relatively inde- 
pendent research direction. This is generally considered to have 
started with the GroupLens system launched by the GroupLens 
research group of the University of Minnesota in 1994 [15]. As 
a highly readable external knowledge carrier, knowledge graphs 
provide a great possibility to improve algorithm interpretation capa- 
bilities [16]. Therefore, we are using knowledge graphs to enhance 
recommendation methods in this work. 

Our main contribution is the following: 


e We present an overall architecture that allows users to build 
knowledge graphs, display knowledge graphs, and enable 
recommender algorithms to be trained with knowledge ex- 
tracted from knowledge graphs, in a domain agnostic manner. 
We demonstrate its application to movie recommendations. 
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Other contributions include: 


e A pipeline architecture that allows users to crawl data, build 
and merge knowledge graphs in different formats, to display 
knowledge graphs and extract information, without having 
to know the underlying format details. 

e A platform for recommender systems researchers to carry 
out experiments on top of TensorFlow and Keras. 


The rest of the paper is organized as follows. We provide a brief 
review of existing recommendation software frameworks. Then, we 
describe the design and implementation of our generic framework 
in a top-down manner, along with examples of instantation and 
specific applications. We conclude along with future extensions. 


2 RELATED WORK 


Literature scan reveals only a few attempts at development of frame- 
works for recommender systems, as most efforts have been in al- 
gorithms and methods. Below is a brief review of related work 
on recommendation software frameworks. We first describe these 
works, then provide a summary table of their main characteristics 
and limitations. 

Yue et al. [22] in their work use gradient descent to learn the 
user’s model for the recommendation. Three machine learning al- 
gorithms (including logistic regression, gradient boosting decision 
tree and matrix decomposition) are supported. Although gradient 
boosting decision tree can prevent overfitting and has strong inter- 
pretability, it is not suitable for high-dimensional sparse features, 
usually the case with items and users. If there are many features, 
each regression tree will consume a lot of time. 

Raccoon [8] isa recommender system framework based on collab- 
orative filtering. The system uses k-nearest-neighbours to classify 
data. Raccoon needs to calculate the similarity of users or items. 
The original implementation of Raccoon uses the well known Pear- 
son distance which is good for measuring similarity of discrete 
values in a small range. But to make the calculation faster, one 
can also use Jaccard distance, which provides a binary rating data 
(ie like/dislike). But collaborative filtering algorithm does not care 
about the detailed features of users or products. It only uses the 
user ID and product ID to make recommendations. 

GER (Good Enough Recommendation) [7] is presented as a scal- 
able, easy-to-use and easy-to-integrate recommendation engine. Its 
core is the same as the knowledge graph triplet (people, actions, 
things). GER recommends in two ways. One is by comparing two 
people, looking at their history, and another one is from a person’s 
history. Its core logic is implemented in an abstraction called the 
Event Storage Manager (ESM). Data can be stored in memory ESM 
or PostgreSQL ESM. It also provides corresponding interfaces to 
the framework developer, including (i) Initialization API for operat- 
ing namespace, (ii) Events API for operating on triples, (iii) Thing 
Recommendations API for computing things, (iv) Person Recom- 
mendations API for recommending users, and (v) Compacting API 
for compressing items. 

LensKit [6], is a Java open-source recommender system, pro- 
duced by the GroupLens Research team at University of Minnesota. 
The java version has been deprecated. The new python version 
of Lenskit is a set of tools for experimenting and researching rec- 
ommender systems. It provides support for training, running and 
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evaluating recommender systems. The recommendation algorithms 
in LensKit include SVD (singular value decomposition), Hierarchi- 
cal Poisson Factorization, and KNN (K-nearest neighbors). LensKit 
can work with any data in pandas.DataFrame (Python Data Anal- 
ysis Library) with the expected (fixed set) columns. Lenskit loads 
data through the dataLoader function. Each data set class or func- 
tion takes a path parameter specifying the location of the data set. 
These data files have normalized column names to fit with LensKit’s 
general conventions. 

DKN (Deep Knowledge-Aware Network for News Recommen- 
dation) [23] proposes a model that integrates the embedded rep- 
resentation of knowledge graph entities with neural networks for 
news recommendation. News is characterized through a highly con- 
densed representation and contains many knowledge entities, and 
it is time sensitive. A good news recommendation algorithm should 
be able to make corresponding changes as users’ interests change. 
To solve the above problems, the DKN model has been proposed. 
First, a knowledge-aware convolutional neural network (KCNN) 
is used to integrate the semantic representation of news with the 
knowledge representation to form a new embedding, and then the 
attention from the user’s news click history to the candidate news 
is established. The news with higher score is recommended. 

Multi-task Feature Learning for Knowledge Graph enhanced 
Recommendation (MKR) [24] is a model that uses knowledge graph 
embedding tasks to assist recommendation tasks. These two tasks 
are linked by a cross-compression unit, which automatically shares 
potential features, and learns the high-level interaction between 
the recommender system and entities in the knowledge graph. It is 
shown that the cross-compression unit has sufficient polynomial 
approximation ability, and MKR is a model of many typical recom- 
mender systems and multi-task learning methods. Our framework 
incorporates a modified version of the MKR model. 

Main Characteristics of Earlier Recommender Systems: 

In the above frameworks, usually just one of the three types 
of recommender system models (collaborative filtering, content- 
based and machine learning) is supported. Table 1 summarizes the 
characteristics of the above frameworks. 

As we can see there is no generic recommender system soft- 
ware framework yet that can support web crawling, information 
update, visualization and input to recommendation methods inde- 
pendent of storage formats and algorithms. We will briefly discuss 
the limitations of presently available frameworks from this view- 
point of genericity. Yue et al. mainly use the boosting model, which 
makes the training time high. Because Raccoon uses the K-nearest- 
neighbours model, it cannot handle new users or new items, well 
known as the cold start problem. Further, it also has data sparseness 
and scalability issues. The advantage of GER is that it contains no 
business rules, with limited configuration, and almost no setup 
required. But this is at the expense of the scalability of the engine. 
Other limitations are that it does not generate recommendations 
for a person with less history, and has the data set compression 
limit problem, i.e., certain items will never be used. For example, if 
items are older or belong to users with a long history, these items 
will not be used in any calculations but will occupy space. The ad- 
vantage of LensKit is that its framework contains many algorithms, 
such as funksvd and KNN, and users can call different algorithms 
according to their needs. In LensKit the data is called through the 
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path parameter, and the data has a specific format. This means that 
the LensKit framework can only process user, item and rating data. 
If the user has new data, such as user information or item classi- 
fication, the LensKit framework cannot handle it. Although DKN 
uses knowledge graph embedding as an aid, it is tailored to news 
recommendations and cannot be easily extended to other product 
domains. The main disadvantage of MKR is that it is not a generic 
framework, and it inputs text documents as knowledge graphs, but 
not in their graph structured form, making it cumbersome to update 
knowledge. 


3 OUR FRAMEWORK DESIGN 


Figure 1 shows the overall design of our framework. It is designed 
as a pipeline of tasks from end user input to final recommendations. 
Further, each stage in the pipeline is designed as a sub-framework, 
some stages are nested, enabling specialization and expansion at a 
more granular level. We denote a generic component as a frozen 
spot and its specialized component as a hot spot. 


StorageManager Knowledge graph viewer recommendation system 
ERCEXE BOE Tea moMeOr framework framework framework 
Neo4jManager DataLoader 
MovieExtractor 
Sees, RDFManager networkxViewer 
DataPreprocessor 
SideinformationLoader MLModelBuilder 
Other Extractor SS 
Framewework TextinfomationLoader OpenGLvViewer 
Predictor 


Figure 1: Core components of the framework, each box in 
blue means the framework’s frozen spot and each box in red 
represents a set of the hot spots for each specialized frame- 
work. 


The four major stages in the pipeline have the following func- 
tionality: 

e InfoExtractor framework: abstracts a web extractor. It takes 
a URL address (such as .html, .htm, .asp, .aspx, .php, jsp, 
.jspx) and extraction rules as input then formats the extracted 
output. This is further nested, to enable domain level (movies, 
news, etc.) specialization. 

e StorageManager framework: abstracts different knowledge 
graph storage formats. It takes string data stream as input 
and generates the output in the required format. 

e Knowledge graph viewer framework: abstracts knowledge 
graph visualization. It takes the triples stream as input then 
creates the visualization. 

e Recommendation method: abstracts the recommendation 
method and knowledge input. Knowledge graph triples form 
the input for training the recommendation model. 


InfoExtractor Framework Design: abstracts common meth- 
ods of information extractors. We take the example of extracting 
movie information - MovieExtractor nested in InfoExtractor. 
It serves as a frozen spot to provide users with functions for cap- 
turing movie information, as shown in Figure 2. MovieExtractor 
is the abstract class of the extractor, it has three basic methods, 
extractDirectorInfo, extractWriterInfo and 
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Name Type Storage types Domain Example uses Language ML Models Maintained 
Yue Ning et. al. Decision Tree - content recommendation content recommendation - Boosting - 
Racoon Collaborative Filtering Redis cross-domain https : | /github.com/guymorita/benchmark,accoonmovielens Java KNN Jan 10, 2017 
GER Collaborative Filtering PostgreSQL movie https : //github.com/ graham jenson/ger/tree/master/examples Java - Jul 9, 2015 
LensKit Machine Learning LocalFile cross-domain https : //github.com/lenskit/Ikpy/tree/ master /examples Python SVD, hpf Nov 10, 2020 
DKN knowledge graph enhanced | _LocalFile News https : |/github.com/hwwang55/DKN Python tensorflow Nov 22, 2019 
MKR knowledge graph enhanced | _LocalFile cross-domain https : | /github.com/hwwang55/MKR Python tensorflow Nov 22, 2019 
PredictionIO machine learning Hadoop,HBase - https : | /github.com/apache/predictionio Scala Apache Spark MLlib —_| Mar 11, 2019 
Surprise Collaborative Filtering LocalFile movie,joke https : |/github.com/NicolasHug/Surprise/tree/master/examples | Python | matrix factorization, KNN | Aug 6, 2020 


Table 1: Characteristics of Existing Frameworks 


InfoExtractor 


+ method: extract other information 


+ method: extractDirectorinfo(movieName: string): List of string 
+ method: extractWriterInfo(movieName: string): List of string 
+ method: extractActorinfo(movieName: string): List of string 


+ method: extractPosterInfo(movieName: string): string 


Figure 2: Design of the InfoExtractor Framework. 


StorageManager 


method: getAllTriples(result_list): List 

method: addNode(node name: string): node 
method: addRelation(relation name: string):realtion 
method: addTriple(triple: list of string): List 
----Use--=>| method: deleteNode(node name: string): Bool 
method: deleteRelation(relation name: string): Bool 
method: deleteTriple(triple: list of string): Bool 
method: findNode(node name: string): node 
method: findRelation(relation name: string): relation 
method: findTriple(triple: list of string): List 


SideinformationLoader 
+ method: loadFile( filename: string): Bool 


+ method: loadFiles( filenames: list of string): Bool 


+ method: setconfiguration(storagetype: string) 


Extends Extends 


RDFManager 


Figure 3: Design of the knowledge graph StorageManager 
framework. 


Neo4jManager 


extractActorInfo. It is used to extract director information, au- 
thor information and actor information respectively. 
extractPosterInfo is the abstract class of the poster downloader. 
It downloads the movie poster (an example of side information) 
and then converts the image into a string. to facilitate storage and 
coding. There are many websites on the Internet that store informa- 
tion about movies, such as IMDB, Wikipedia, Netflix and Douban. 
Extractors dedicated to these websites can be used as hotspots. 

StorageManager Framework Design: StorageManager is a 
frozen spot, as shown in Figure 3. Its design allows us to add support 
for different kinds of storage modes easily. SideInformationLoader, 
is again designed as a frozen spot, and its responsibility is to add 
triples to the knowledge graph through the method in 
StorageManager to increase the knowledge provided to the recom- 
mendation method. SideInfomationLoader contains three meth- 
ods, they are loadFile, loadFiles and setConfiguration. 

Knowledge Graph Viewer Framework Design: This frozen 
spot responsibility is to display triples for knowledge visualization. 
As shown in Figure 4, KnowledgeGraphViewer is an abstract class, 
which contains a show method. 
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networkx library 


openGL library 
| KnowledgeGraphViewer 


method: show(ka) 


Extends Extends 


OpenGLViewer networkxViewer 


method: show(kg) 


method: show(kg) 


Figure 4: Design of the knowledge graph viewer framework. 


Recommendation Method Framework Design: 

We have designed a dedicated framework as shown in Figure 5. 
DataPreprocessor module includes three methods, which are 
preprocesskG, preprocessUserInfo and preprocessRating. 
preprocesskKG is used to process knowledge graph triples. It re- 
turns three dictionaries to store the id corresponding to the prod- 
uct, the id corresponding to the relationship and the id corre- 
sponding to the character. preprocessUserInfo is used to encode 
user information, including the user’s gender, age, and occupation. 
preprocessRating uses the three dictionaries stored in the previ- 
ous step to convert the product and user ID in the rating information. 
It consists of four frozen spots, DataLoader, DataPreprocessor, 
MLModelBuilder and Predictor. DataLoader reads all the triples 
from the file generated in the previous step, and returns three lists, 
number of users, number of items, and number of relations. loadKG 
is used to load the KG file processed in the previous step, calculate 
the number of entities, the number of relations, and return these 
values. These parameters are used to create the data needed for 
building the prediction model. loadUsers loads the user informa- 
tion file processed in the previous step, calculates the number of 
users, genders, ages and jobs. These parameters are used to cre- 
ate a user information matrix when training the neural network. 
loadRatings loads the rating file, then calculates the number of 
items, and then divides the data into the training data, evaluation 
data and test data according to the ratio of 6:2:2. The model is built 
through MLModelBuilder. 

After training, the model is used to predict the user’s rating 
for a specific product. Predictor facilitates this prediction. It in- 
cludes two methods, getUserInfo and predictScore. The user’s 
ID is used to query the user’s personal information. Three lists are 
returned, user’s gender, age, and job information. predictScore 
returns a float value, which represents the user’s rating for the 
product. 
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MLModelBullder 


<<interface>> 
MLModelBuilder 


+ userlD: int 
+ movielD: int 


+ buildinput( int(userlD), int(itemID), float(label), int(headID), int(taillD), 
int(relation|D), float(dropout), int(gender), int(age), int(job)): 

+ buildLayer( int(userNumber), int(itemNumber), int(entityNumber), 
int(relationNumber), int(genderNumber), int(jobNumber), 
int(ageNumber)): 

+ buildLoss( userEmbedding, itemEmbedding, L2Weight): 

+ buildTrain( float(RS learning rate), float(KGE learning rate)): 

+ eval( int(label), int(userID), int(userGender), int(userAge), int(userJob), 

int(itemID), int(label), int(headID), float(dropout)): 


be 
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DataLoader van 


Predictor = /\ RecommendationSystem/\, 


<<interface>> 
DataLoader 


+ storageModule: StorageManager 
+ userList: <List> 


<<interface>> [ss 
Predictor 
+ userlD: int —— _ MLModelBullder 
+ userAge: int 


+ userGenger: string 
+ userJob: string 


+ movielD: int Cd 


+ ratingsList: <List> 
+kgList<List> 


+ loadKG(dataset): entityNumber, relationNumber, kg 
+ loadUsers(dataset): userNumbers,userGenderNumbers 
userAgeNumbers, userJobNumbers 


+ loadRatings(dataset): itemNumbers, trainData,evalData 
testData 


+ getUserinfo(userID): 
list [userGender], 
list [userAge], 


list [userJob] 
+ predictScore(str(useriD),str(movielD)): 
string(predictScore) 


DataPreprocesso 


<<interface>> 
DataPreprocessor 


+ userList: <List> 
+ ratingsList: <List> 
+ kgList:<List> 


+ preprocessKG(path: string): 


map(movie), map(relation), map(person) 
+ preprocessRating(path: string): Bool 
+ preprocessUserInfo(path: string): Bool 


DataPreprocessor K eee ! DataLoader >| Extends Extends 
| TextDataLoader OtherDataLoader 
| + field: type + field: type 
| + method(type): loadKG(path: str) + method(type): type 


Figure 5: Design of the recommendation method framework. 


4 FRAMEWORK INSTANTIATION 


Instantiation is the process of specaializing the framework to result 
in an executable recomemdation system, for specific application 
environments. This is done by providing executable modules within 
frozen spots. Every frozen spot is specialized, depending on the 
need. Below we discuss a few illustrative examples. 


IMDBExtractor Instantiation: Earlier, we described the design 
of the InfoExtractor module within the framework. If instanti- 
ated for movie recommendation, we extract director, writer, stars 
information, movie genre, movie poster and other available movie 
data as side information. In IMDBExtractor module, we create a 
list for each kind of information to be extracted. Because there may 
be many directors, actors, and stars of the movie, the information 
for each category is returned as a list. If the relevant information 
cannot be found in IMDB, an empty list will be returned. This is 
the incomplete information case. Each triplet will be stored in the 
form of head, relation, tail. 

StorageManager Instantiation: We designed the frozen spot 
for a knowledge graph StorageManager framework in general. We 
created two storage modules as hot spots for knowledge graphs. 
They are Neo4jManager and RDFManager to accommodate the two 
different knowledge graph formats. Figure 6 shows the structure of 
this module. Figure 7 illustrates the structure of Neo4j storage. Use 
of the RDFManager requires operations of owl, including: RDFSave, 
RDFGetOntology, RDFAddClass, RDFAddIndividual, 
RDFAddDataproperty, RDFAddDatapropertyValue and 
RDFAddObjectproperty. Figure 8 illustrates the structure of RDF 
storage. When the user chooses to use RDF storage, it will call the 
RDFManager in the framework, use the API in our framework to 
operate on the triples, and then save it as an RDF file. 
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StorageManager 


0G Neo4jManager 
RDFManager 


Application user 


Add new storage 


Framework developer 


StorageManager 


create storage manager() 
~createManager() 


+ method: getAllTriples() 


+ method: findTriple() 


Neo4jManager RDFManager 


Figure 6: Instantiation of the StorageManager in framework. 
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1. movies.txt 
2. ratings.txt 
3. movie_info.txt 
4.users_info.txt 
5. poster.txt 


all 


1. movies.txt 
F 2. ratings.txt 
user_info 3. users_info.txt 


Read and analysis 
the command line parameter: dataset - 
option poster_info 


movie_info 


save to Neo4j 


1. movis.txt 
2. ratings.txt 
3. poster.txt 


1. movies.txt 
2. ratings.txt 
3. movie_info.txt 


Figure 7: Structure of the Neo4j storage module in frame- 
work. 


1. movies.txt 
2. ratings.txt 
3. movie_info.txt 
4.users_info.txt 
5. poster.txt 


1. movies.txt 
2. ratings. txt 
3. users_info.txt 


Ontology file 


Read and Analysis |}——>) parameter: dataset save as ADF file 


poster_info 


1. movis.txt 
2. ratings.txt 
3. poster.txt 


movie_info 


1. movies.txt 
2. ratings.txt 
3. movie_info.txt 


Figure 8: Structure of the RDF storage module in framework. 
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Knowledge graph viewer framework 


input: triples dd . 
add new viewer 
networkxViewer 


user developer 


output: draw the graph with Matplotlib 


OpenGLViewer 


KnowledgeGrapgViewer networkxViewer 


addEdge 


Figure 10: Instantiation of the networkxViewer framework. 


Recommendation System Framework 


input: triples, 

hyperparameters for MI DataLoader 
output: 

trained model 


DataPreprocessor 


TextDataLoader TextDataPreprocessor 


user 


add new models 
MLModelBuilder Predictor 
developer input:movielD,userlD 
TensorflowMLBuilder TensorflowPredictor output:predict rating 


user 
RecommendationSystem 
DataLoader 
+ pretrainModel:pb 


| 


DataPreprocessor 


+ method: loadData() 


+ method: dataPreprocess() 


framework developer 


develop 


SideInformationLoader 


TextinformationLoader +--Use>>|  StorageManager 


Extends Extends 


TextinformationLoader OtherinformationLoader 


+ string(head) tu. 


+ string(relation) 


+ method: loadFile(filename: string): Bool 


+ string(tail) 


+ method: loadFile(filename: string): Bool 


Figure 9: TextInformationLoader instantiation. 


TextInfomationLoader: To add additional (side) information 
when available, based on the frozen spot SideInformationLoader, 
we created the TextInformationLoader hot spot. The overall idea 
is shown in Figure 9. We read the file in the parameter, parse the file 
according to the format, and extract the head, relation and tail of 
the triples. Then add new triples to the knowledge graph through 
addTriple in StorageManager. Since there is already a method 
for adding a single file, we only need to make some adjustments 
to loadFiles. When reading multiple files, we just need to call 
loadFile for each file. 

networkxViewer Instantiation: The workflow of the viewer 
module is shown in Figure 10. Our process can be described by the 
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+ method: buildModel() 


ee 


MLMode!Builder r 5 


pretrained Model 


Figure 11: Workflow of recommender system framework 


+ method: predictScore() 


following steps: 


(1) Read all the individual names and store all the individuals in 
viewerIndividuals. 

(2) Take the individual list from the previous step and get all 
the information connected to this individual. 

(3) According to the content in the individual, create nodes or 
links respectively. 


Recommendation Method Instantiation: The recommenda- 
tion method is the most important part of our framework. Here 
we present a specialization of the recommendation method frozen 
spot, using a deep learning method. We decided to implement the 
entire method ourselves, as we wanted to incorporate the benefits 
of side information obtained from using knowledge graphs. While 
it is based on the work of [11, 13], our recommendation method[19] 
is written in Python and built on top of TensorFlow. Figure 11 illus- 
trates the workflow of the recommendation method module. We 
suitably modified and implemented the network architecture of the 
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Figure 12: Structure of the recommendation module in RS 
framework. 


target 


predict probability 


{hL rl] 


Figure 13: Structure of the knowledge graph embedding 
module in recommender system framework. 


MKR method. The structure is shown in Figure 12. The original work 
does not contain user demographics, such as user_gender, user_age 
and user_job embeddings, which we have added for training. To 
unify the latitude (in the deep learning network), we also choose 
arg.dim as the dimension of user_age, user_job and user_gender. 
For the knowledge graph embedding model, the structure is 
shown in Figure 13. Let’s take our MovieLens data as an example. 
There are 3883 heads and four relations in the data. Therefore, the 
item matrix is the 3883 x arg.dim matrix and the relation matrix 
is the 4 x arg.dim matrix. Each time, we take out the vector corre- 
sponding to the head from the head embedding according to the 
head index, and then process the crossover and compression unit 
to obtain an H7b (head) vector. The R; (relation) vector is obtained 
by looking up the vector of the relation index in the relation matrix 
and then passing through a fully connected layer. Then we merge 
the [batch_size, dim] dimension Hj and R; into a [batch_size, 2 x 
arg.dim] vector, and this vector is passed through a fully connected 
layer to get a [batch_size,arg.dim] vector. This vector includes the 
predicted tail. For the cross and compress unit, item and head are 
each a vector of dimension [batch_size,arg.dim]. To facilitate calcu- 
lation, we first expand them by one dimension so that they become 
[batch_size,arg.dim,1] and [batch_size,1,arg.dim] respectively, and 
then multiply them, to get the cross matrix c_matrix, a matrix of 
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[batch_size,dim,dim], and a transpose matrix c_matrix_transpose, 
the two matrices are multiplied by different weights, then reshaped 
to obtain the final vectors. 

During the training, our model is guided by three loss functions: 
Lrs loss and Lege loss and Lpgg loss. 


L=Lps +Llxg+Lrec (1) 
The complete loss function is as follows: 


e The first item is the loss in the recommendation module. 

e The second item is the loss of the KGE module, which aims 
to increase the score of the correct triplet and reduce the 
score of the wrong triplet. 

e The third item is L2 regularization to prevent overfitting. 


Predictor: Once the model is trained, we can make predictions 
using our predictor module. There are two methods in this module, 
getUserInfo and predictScore. getUserInfo is a method used 
to extract user’s information. The model trained in the previous step 
is loaded into predictScore, and then different matrices are read 
by name. If the id entered by the user is greater than the dimension 
of the model. That means the id is a new user. Our recommendation 
will focus on the user’s age and job. Finally, a float value (predicted 
rating) is returned. These steps are described in Algorithm 1. 


Algorithm 1: prediction application procedure 


1 parser < init a argument parser 
dataset < init default dataset 


bd 


3 userid < init default userid 
4 movieid < init default movieid 


6 if trained model path exist then 
7 | load trained model 


predict user’s rating to movie 


9 else 
10 |_ Error 


Deployment: Figure 14 shows the diagram of all the required 
libraries. the purpose of each is briefly described below: 


e requests is used to issue standard HTTP requests in Python. 
It abstracts the complexity behind the request so that users 
can focus on interacting with the service and using data in 
the application. 

e bs4 library is a functional library responsible for parsing, 
traversing, and maintaining the HTML “tag tree". 

e IMDB is an online database of movie information. 

e Base64 is a library that uses 64 characters to represent arbi- 
trary binary data. 

e py2neo can use Neo4j from within the Python application 
and from the command line. 

e owlready2 is a module for ontology-oriented programming 
in Python. 

e rdflib is used to parse and serialize files in RDF, owl, JSON 
and other formats. 

e networkx is a graph theory and complex network modelling 
tool developed in Python language, which can facilitate com- 
plex network data analysis and simulation modelling. 
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matplotlib is adata visualization tool. 

linecache and tensor flow libraries. 

random library is used to generate random numbers. 

NumPy is a math library mainly used for array calculations. 
sklearn is an open-source Python machine learning library 
that provides a large number of tools for data mining and 
analysis. 

linecache is used to read arbitrary lines from a file. 
TensorF low is a powerful open-source software library de- 
veloped by the Google Brain team for deep neural networks. 


5 FRAMEWORK APPLICATION 


Integrated Lenskit Application: This Lenskit application is a 
comparison of NDCG (Normalized Discounted Cumulative Gain) 
values for Lenskit recommendation algorithms. Figure 15 shows 
the result of the evaluation. 

Prediction of User’s Rating for a Movie: So far, what we have 
described in the previous sections are the design, implementation 
and instantiation of our framework. Our prediction service predicts 
users’ ratings of products (as an example product we have chosen 
movies). The workflow can be described as follows: 


(1) Read all the triple data through dataLoader to get the cor- 
responding information. 

(2) Use these triples and the user’s rating of the movie as input, 
and train through the RS module. 

(3) Load the trained model, and predict the user’s rating of the 
movie. 


For the prediction task, we use the pipeline shown in Figure 16. 
Now, with this pipeline instance, application developers no longer 
need to worry about details. All the user needs to do is provide the 
right input. 

Predict User’s Rating for a Book: The data we use is called 
Book-Crossing [5]. Book-Crossing dataset is Collected by Cai Nico- 
las Ziegler from the Book-Crossing community. The Book-Crossing 
dataset comprises 3 tables. They are users, Books and Ratings. Users 
contain the user’s id, the user’s location, and the user’s age. Books 
include book title, book author, year of publication, publisher and 
other information. Ratings include user reviews of the book. Be- 
cause the process used is the same as for movies, we won’t repeat 
it here. 

Knowledge Graph Fusion Application: Knowledge fusion is 
an effective way to avoid node duplication in knowledge graphs. 
Users may have knowledge graph files in different formats which 
may result in duplicate nodes with the same information. We use 
knowledge graph fusion to solve the problem of duplicate nodes. 
In our implementation, we provide an API where users can convert 
Neo4j triples to RDF or convert them to Neo4j based on RDF triples 
according to their needs. Algorithm 2 describes this procedure. 


6 FRAMEWORK EVALUATION 


Evaluation Testbed Specifications: Before starting the discus- 
sion about the evaluation, we first describe the environment, includ- 
ing the operating system used, processor power, memory, hardware, 
etc. The detailed specifications can be seen in Table 2. Table 3 lists 
the various libraries used in this implementation. 


IDEAS 2021: the 25th anniversary 


Sudhir P. Mudur, Serguei A. Mokhov, and Yuhao Mao 


Algorithm 2: Knowledge graph fusion application proce- 
dure 

1 parser < init a argument parser 

2 mode <— init default storage mode 

3 path < init default path 

4 

5 if args.mode == "neo4j2RDF" then 

6 | call Neo4j to RDF API 


7 else 
|_ call RDF to Neo4j API 


oo 


10 Neo4j_triples_list < init a list for all the triples in Neo4j 
11 RDF _triples_list < init a list for all the triples in RDF 

12 while read file do 

13 add triples to Neo4j_triples_list 

14 add triples to RDF_triples_list 

15 while read Neoj_triples_list or RDF_triples_list do 


16 if node exist in RDF_triples_list or Neo4j_triples_list 
then 

17 find the corresponding node 
18 else 
19 i: create new node 
20 connect nodes and relations 

Setting Name Device 

Memory 8 GB 
Laptop Processor 2.3 GHz Intel Core i5 


Graphic | Intel Iris Plus Graphics 640 1536 MB 
OS Mac OS Mojave 10.14.6 
Memory 12 GB 


Server (colab) | Processor Intel Core i7-920 CPU 2.67GHz 
Graphic GeForce GTX1080 Ti (12 GB) 
OS Ubuntu 18.04.5 LTS 64-bit 


Table 2: Environment hardware specifications. 


Version 


Tensorflow 


networkx 
matplotlib 


Libraries 


rdflib 4.2.2 
numpy 1.17.0 
sklearn 0.21.3 
pandas 0.25.0 


Table 3: Python libraries used. 
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OpeniSS RS 
StorageManager 
<<libraries>> 
py2neo 
<<libraries>> 
owlready2 
<<libraries>> 
rdflib 


<<libraries>> 
requests 
<<libraries>> 
bs4 


<<libraries>> 
IMDB 
<<libraries>> 
base64 
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cw 
<<libraries>> 
networkx 
<<libraries>> 
matplotlib 
<<libraries>> 
rdflib 


<<libraries>> <<libraries>> 
random sklearn 

<<libraries>> <<libraries>> 
py2neo linecache 

<<libraries>> <<libraries>> 
numpy os 


<<libraries>> 
tensorflow 


Figure 14: UML deployment diagram. 
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Figure 15: LensKit evaluation result. 
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Figure 16: Prediction pipeline. 


Due to different operating systems, the software is slightly differ- 
ent, so we also need to give the software details of the environment. 
For these two environments: one is a laptop and the other is a server, 
we call them Setting 1 and Setting 2, respectively. There is a slight 
difference between the installed software versions. For these two 
environments, we give more detailed information in Table 4. 


Setting 1 | Setting 2 


Python 


Neo4j Desktop 


Table 4: Software packages and IDE tools used. 


Real-time Response: According to [20], we set the real-time 
response baseline to be 2000 ms. To evaluate the whole system’s 
processing ability, we performed the experiment described below: 


(1) Load the trained model to get the pre-trained graph structure 
and weights. 

(2) Prepare feed_ dict, new training data or test data. In this way, 
the same model can be used to train or test different data.. 

(3) Measure the difference between the timestamp t;before the 
pipeline start and the timestamp tg after system processing. 


IDEAS 2021: the 25th anniversary 


(4) Repeat the previous operation 100 times, and then calculate 
the average processing time through the formula Equation 2. 

jet te ~ ts 

100 


Result = (2) 
The result shows that the speed of our solution on a local laptop 
machine is 634.33ms. It is faster than the real-time baseline of 
2000ms. 

Experimental Results: We trained the MKR model using the 
MovieLens-1m dataset, and then used the validation set to verify the 
model. We split all data according to 6:2:2, i.e., 60% is the training 
set, 20% is the validation set, and 20% is the test set. The data of the 
validation set and test set were not be used for training. Our side 
information can be of many types, including movie information, 
user information, and movie posters. We train with different side 
information through our model and obtain the results through 20 
epoch training, as shown in Table 5. 

From the results, we can see that the accuracy of data with user 
information and movie information is the highest, about 1% higher 
than the baseline. Because users may watch a movie because of a 
director or an actor, the other movie information can help improve 
accuracy. The age, job and other information in the user information 
also help to improve the accuracy, because the user may choose 
some related movies to watch based on age and occupation. But 
because the poster of each movie is different, in the knowledge 
graph, each poster is connected to only one movie node, so the 
poster data is sparse data for the knowledge graph. Therefore, the 
poster information does not have a good effect for us at present, 
but if we can extract some useful information from the poster 
through technologies such as computer vision, it may be helpful in 
improving the accuracy of the recommendation. 


7 CONCLUDING REMARKS AND FUTURE 
EXTENSIONS 


We proposed, designed and implemented a generic software frame- 
work that integrates knowledge graph representations for provid- 
ing training data to enhance the performance of deep learning 
based recommendation methods. We demonstrate the improve- 
ments in performance for the domain of movie recommendations 
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evaluate ACC | test AUC | test ACC 


baseline 0.9189 0.8392 


baseline+movie 


0.9046 0.8277 0.9042 0.8261 


0.9227 0.8439 0.9081 0.8295 0.9061 0.8297 


0.9238 0.8455 0.9096 0.8321 0.9091 0.8331 


baseline+user+movie 


baseline+poster 


0.9292 0.8516 
0.9173 0.8279 


baseline+movie+user+poster 0.9273 0.8497 0.9113 0.8351 0.9111 0.8349 


Table 5: Results of the same models on different datasets. 


to users made possible by the additional information that is ex- 
tracted through knowledge graphs. At the core of this framework 
are knowledge graphs for storing and managing information for use 
by recommendation methods. The framework is generic and can be 
specialized to accommodate different product domains, recommen- 
dation algorithms, data gathering strategies, and knowledge graph 
storage formats. To the best of our knowledge, a similar framework 
is not available elsewhere. 

The ultimate goal of our work is to make it as a research platform 
for more developers in the recommender systems field. With that 
goal our framework needs to be extended as follows: 

Java API wrapper: Our framework was written in Python, but 
the movie recommender system is mostly used on web pages. So it 
is better for users, if can we provide a Java wrapper for our API. 

Support different machine learning backends: Currently, 
our recommendation module only supports TensorFlow. But there 
are many different deep learning frameworks, such as PyTorch, 
Caffe or Scikit-learn. Different frameworks have their advantages. 
We plan to add various machine learning frameworks to our frame- 
work in the future. 

Support more storage methods and more input formats: 
Currently, we only support four storage formats, namely RDF, RDFS, 
OWL and Neo4j. For the input format, because we use CSV for 
storage, some users may choose JSON format or TTL format, so we 
also need to update the program to support these formats. 
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ABSTRACT 


This article introduces the concept of Sink Group Node Betweenness 
centrality to identify those nodes in a network that can “monitor" 
the geodesic paths leading towards a set of subsets of nodes; it 
generalizes both the traditional node betweenness centrality and 
the sink betweenness centrality. We also provide extensions of 
the basic concept for node-weighted networks, and also describe 
the dual notion of Sink Group Edge Betweenness centrality. We 
exemplify the merits of these concepts and describe some areas 
where they can be applied. 
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1 INTRODUCTION 


The dramatic growth of online social networks during the past 
fifteen years, the evolution of Internet of Things and of the emerging 
Internet of Battlefield Things, the extensive study and recording of 
large human social networks is offering an unprecedent amount 
of data concerning (mainly) binary relationships among ‘actors’. 
The analysis of such graph-based data becomes a challenge not 
only because of their sheer volume, but also of their complexity 
that presents particularities depending on the type of application 
that needs to mine these data and support decision making. So, the 
field termed network science has spawned research in a wide range 
of topics, for instance: a) in network growth, developing models 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than ACM 
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 
to post on servers or to redistribute to lists, requires prior specific permission and/or a 
fee. Request permissions from permissions@acm.org. 

IDEAS 2021, Fuly 14-16, 2021, Montreal, QC, Canada 

© 2021 Association for Computing Machinery. 

ACM ISBN 978-1-4503-8991-4/21/07...$15.00 
https://doi.org/10.1145/3472163.3472182 


IDEAS 2021: the 25th anniversary 


Dimitrios Katsaros 
University of Thessaly 
Volos, Greece 
dkatsar@e-ce.uth.gr 


Yannis Manolopoulos 
Open University of Cyprus 
Nicosia, Cyprus 
yannis.manolopoulos@ouc.ac.cy 


such as the preferential attachment [4], b) in new centrality mea- 
sures for the identification of the most important actors in a social 
network developing measures such as the bridging betweenness 
centrality [28], c) in epidemics/diffusion processes [18], d) in finding 
community structure [21, 22], i-e., finding network compartments 
which comprise by sets of nodes with a high density of internal 
links, whereas links between compartments have comparatively 
lower density, e) in developing new types of networks different from 
static and single-layer networks, such temporal [40] and multilayer 
networks [9], and of course analogous concepts for these types 
of networks such as centralities [26], communities [27], epidemic 
processes [5], and so on. 

Despite the rich research and the really large number of concepts 
developed during the past twenty years and the diverse areas where 
is has been applied e.g., ad hoc networking [30], the field of network 
is continuously flourishing; the particular needs of graph-data 
analytics create the need for diverse concepts. For instance, let us 
examine a very popular and well-understood concept in the analysis 
of complex networks which is the notion of centrality [3, 35, 42], 
being either graph-theoretic [23], or spectral [34] or control theo- 
retic [29]. Betweenness centrality [23] in particular has been very 
successful and used for the design of effective attacks on network 
integrity, and also for discovering good “mediators" (nodes able to 
monitor communication among any pair of nodes in a network), 
but it is not effective in identifying influential spreaders; for that 
particular problem k-shell decomposition [31] proved a much better 
alternative. However subsequent research [11] proved that a node’s 
spreading capabilities in the context of rumor spreading do not 
depend on its k-shell index, whereas other concepts such as the 
PCI index [6] can perform better. So, our feeling is that the field 
of network science, even some very traditional concepts such as 
betweenness centrality could give birth to very useful and practical 
variants of them. 

Let us describe a commonly addressed problem in network 
science concerning malware spreading minimization problem. In 
particular, we are given a set of subsets of computer nodes that 
we need to protect against a spreading malware which has already 
infected some nodes in the network, but we only have a limited 
number of vaccines (i.e., a “budget") to use. If we had to select some 
healthy (susceptible) nodes to vaccinate, then which would these 
nodes be? Our decision would of course depend on the infection 
spreading model, but usually routing in computer networks imple- 
ments a shortest-path algorithm. So can the traditional shortest- 
path node betweenness centrality [23] help us to identify such 
nodes? 

Additionally, we will describe a similar problem that a modern 
army could possible face due to the rapid deployment of Internet of 
Battlefield Things [44]. The army needs to destroy by jamming the 
communications towards a set of subset of nodes of the enemy, but 
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it has available only a limited number of jammer or time to deploy 
them. The question is now which links should the army select 
to attack? So can the traditional shortest-path edge betweenness 
centrality [15] help us to identify such links? 

The answer to previous questions is negative, because node/edge 
betweenness centrality calculates the importance of a node/edge 
lying on many shortest paths connecting any pair of network 
nodes, whereas in our problems we are interested in paths leading 
towards specific sets of subsets of network nodes. Starting from this 
observation, in this article we introduce the generic concept of sink- 
group betweenness centrality which can be used as a building block 
for designing algorithms to address the aforementioned problems. 

The aim of the present article is to introduce the new central- 
ity concept for various types of complex networks and also to 
present some potential uses of it for solving some network science 
problems. In this context, the present article makes the following 
contributions: 


e it introduces a new centrality measure, namely the sink 
group node betweenness centrality; 

e it extends the basic definition for networks weighted on the 
nodes; 

e it extends the basic definition for the case of edge between- 
ness; 

e it presents basic algorithmic ideas for calculating the afore- 
mentioned notions. 


The rest of the article is structured as follows: section 2 presents 
the articles which are closely related to the present work; section 3 
defines the sink group betweenness centrality concept, and in 
section 4 we provide a detailed example to exemplify the strengths 
of the new concept. Then, in sections 5 and 6 we present the 
definitions of sink-group betweenness centrality for node-weighted 
networks and edge sink group betweenness centrality, respectively. 
Finally, section 8 concludes the article. 


2 RELATED WORK 


Betweenness Centrality. 

The initial concept of (shortest path) betweenness centrality [23] 
gave birth to concepts such as the proximal betweenness, bounded- 
distance betweenness and distance-scaled betweenness [13], bridg- 
ing centrality [28] to help discover bridging nodes, routing between- 
ness centrality [16] to account for the paths followed by routed 
packets in a networks, percolation centrality [39] to help measure 
the importance of nodes in terms of aiding the percolation through 
the network. There are so many offsprings of the initial concept, 
that even a detailed survey would found practically impossible to 
record each one published! 

The realization that the computation of betweenness centrality 
requires global topology knowledge and network-wide manipula- 
tion, which is computationally very expensive, spawned research 
into distributed algorithms for its computation [10], and inspired 
variants aiming at facilitating the distributed computation of no- 
tions similar to the original betweennees centrality, such as load 
centrality [37]. 

Betweenness centrality and its variations have found many 
applications not only in classical fields in network mining, but also 
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in delay-tolerant networks [38], ad hoc networks [30], in distributed 
systems e.g., for optimal service placement [43], etc. 
Approximate Betweenness Centrality. 

Many applications working over modern networks require the 
calculation of betweenness centrality in nearly a real-time fashion 
or have to deal with a huge number of nodes and edges. So, the 
decision of trading off the accuracy of betweenness centrality 
computation with speed arises as a natural option. Some early 
works considered approximating the exact values of betweenness 
centrality [2, 24]. Later on, this became a very active research 
field [10]; a survey can be found at [41]. 

Group Betweenness Centrality. 

Group betweenness centrality indices [19] measure the impor- 
tance of groups of nodes in networks, i.e., they measure the per- 
centage of shortest paths that pass through at least one of the nodes 
of the group. On the other hand, co-betweenness centrality [32] 
measures the percentage of shortest paths that pass through all 
vertices of the group. 

Algorithms for fast calculation of these group centrality mea- 
sures have been developed [14, 45] even for diverse types of net- 
works [36]; group (co-)betweenness or their variations have found 
applications in monitoring [17], in network formation [8], etc. 
Sink Betweenness Centrality. 

The notion of Sink Betweenness centrality [46], which is a 
specialization of our Sink Group Betweenness Centrality, was de- 
veloped in the context of wireless sensor network to capture the 
position of nodes which lie in many paths leading to a specific 
node, i.e., the sink. So, the sink betweenness of a sensor node was 
correlated to the energy consumption of than node, since it had to 
relay a lot of messages towards the sink node. 


3 THE SINK GROUP BETWEENNESS 
CENTRALITY 


We assume a complex network G(V, E) consisting of n nodes, where 
V = {vj,1 < i < n} is the set of nodes and E = {(vj,0;),i,j € 
V} is the set of edges. We make no particular assumptions about 
the network being directed or undirected; this will be handled 
seamlessly by the underlying shortest-path finding algorithm. We 
assume that the network is unweighted, that is, neither the nodes 
nor the edges carry any weights; however, the former assumption 
will be reconsidered in section 5. 

Recall that the goal of sink group betweenness centrality is to 
discover which nodes lie in many paths leading towards a particular 
set of designated nodes. To achieve our purpose we combine the 
concepts of Group Betweenness (GSC) [20, 33] (although in a 
different fashion than in the initial definition) and the concept of 
Sink Betweenness (SSC) [46]. In the following we will define the 
measure of Sink Group Node Betweenness Centrality (SGB8C), but 
firstly we will remind to the reader some definitions. 

Recall that the Shortest Path Betweenness (SPBC) Centrality 
is defined as follows!: 


Definition 3.1 ((Shortest Path) Betweenness Centrality [23]). The 
(Shortest Path) Betweenness Centrality (SPBC) of a node v is the 


lWhen using the term “betweenness centrality" we refer to the concept of node 
betweenness centrality. When we wish to refer to the “edge betweenness centrallity" 
we will explicitly make use of this term. 
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fraction of shortest paths between any pair of nodes that v lies on. 
Equation 1 calculates the SPC of node v. 


n n 
Ojj\U 
SPBC(v) =) > ste (1) 
m1 jar OY 
i#j#o 


where oj; is the number of shortest paths from node i to j, and oj; (v) 
is the number of shortest paths from i to j where node zg lies on. 
This is the non-normalized version of betweenness centrality, i-e., 
we do not divide by the total number of node pairs. 


Definition 3.2 (Sink Betweenness Centrality [46]). The Sink Be- 
tweenness Centrality (SBC) of a node v is the fraction of shortest 
paths leading to a specific sink node s, that v lies on. Formally, we 
provide Equation 2 for calculating the SBC of node v. 


sBC(v)= (0) (2) 
peer 


S is the sink node 


Apparently, SSC is a specilization of SPBC ie., j does not 
iterate over all nodes of the network but it is kept fixed and coincides 
with the sink node s (j = s). 

Suppose now that there is some “abstract grouping" process 
that defines non-overlapping clusters of nodes over this network. 
In principle, we do not need to apply any grouping algorithm at 
all, but we can assume that some application selects the members 
of each cluster as part of our input. The union of these clusters 
may not comprise the whole complex network. We expect that 
Usually the size of the aggregation of all these clusters comprises 
a small fraction of the complex network. Let us assume that we 
have defined z non-overlapping clusters, namely C1, C2,...,Cz with 
cardinalities |C1|, |C2|,...,|Cz|, respectively. Then, the Sink Group 
Betweenness Centrality is defined as follows: 


Definition 3.3 (Sink Group Betweenness Centrality). The Sink 
Group Betweenness Centrality (SG BC) of a node v is the fraction 
of shortest paths leading to any node, which is a member of any 
designated cluster, that v lies on. Formally, we provide Equation 3 


for calculating the SGC of node v. 


n 
O7j\V 
SGBC(v) = > » aut) (3) 
i=1 JEUF_ Ck 4 
v€UT_,Ck 


where n is the number of nodes, C; is the k-th cluster, oj; is the 
number of shortest paths from node i to j, and oj; (v) is the number 
of shortest paths from i to j where node v lies on. 

Intuitively, a node has high Sink Group Betweenness Centrality 
if it sits in many shortest paths leading towards nodes belonging 
to any group. 

Definition 3.3 requires that the node whose SGSC we calculate 
is not part of any existing cluster. This is not mandatory in general, 
but since we are looking for nodes which can act as mediators in the 
communication with the clusters’ nodes, it makes sense to exclude 
the clusters’ nodes from being considered as potential mediators. 
By removing this constraint, we get the concept of Generalized 
SGS8C, but in this article we consider it to be equivalent to the 
plain SGBC. 
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We have considered unweighted and undirected complex net- 
works. The extension of the definition of Sink Group Betweenness 
Centrality for directed networks is straightforward, because it is 
handled by the path finding algorithm. On the other hand, the 
extension to node or edge weighted networks is less straightforward 
and we will provide some ideas in a later section. 


3.1 SGS8C versus its closest relatives 


SGSC has as its special cases the concepts of betweenness cen- 
trality and of sink betweenness centrality. In particular, SGC is 
related to its closest relatives as follows: 


e Clearly, SGSC is related to the SPSC in the following 
way: when the union of all clusters comprises the whole 
network, then Generalized SG BC and SPBC coincide. 
Moreover, SGSC is a generalization of SBC in the follow- 
ing way: when we have only one cluster which contains a 
single node, then SGBC and SSC coincide. 

e At this point we need to make clear the difference between 
Group Betweenness Centrality [20] and SGSC; the former 
seeks for the centrality of a group of nodes with respect to the 
rest of the nodes of the network, whereas the latter seeks for 
the centrality of a single node with respect — not the nodes 
of the whole network but — to the nodes belonging to some 
groups (clusters). Apparently, we can generalize Definition 3.3 
and Equation 3 to follow the ideas of [20]. 


4 EXEMPLIFYING THE MERIT OF SGS8C 


Let us now look at the small complex network of Figure 1. This net- 
work represents a collection of communicating nodes administered 
by an overseeing authority. We have no weights on links, but we 
have denoted the nodes that are mostly important for the authority 
— and thus must be protected better — with red color. We have not 
designated any attacker or attacked node in the figure, because in 
many situations the attack might be known where it will initiate. 


ai Group=1 Y 
? 
I 1 


Red: Node(s) which are our target 
Pink: Groups of (target) nodes 
Purple: Node(s) with high Sink Betweenness Centrality 


Green: Node(s) with high Sink Group Betweenness Centrality 
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Figure 1: Illustration of Sink Group Betweenness Centrality. 
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Suppose that this authority is interested in investing a limited 
amount of money to buy hardware/software to equip some nodes with 
hoping that these will protect the significant ones, e.g., by stopping 
any cascade of infections. Let us call such nodes safeguarding 
nodes. Moreover, it is obvious that the authority needs to identify a 
limited number of safeguarding nodes, due to the limited budget. So, 
how we identify safeguarding nodes by exploiting the topological 
characteristics of the complex network? 

The obvious solution is to look at the one-hop neighbors of the 
safeguarding nodes. Then, we can make use of the SBC [46] and 
say that the purple node (Y) is a safeguarding node. The purple node 
sits on many shortest paths towards the red node Al (Group-1). In 
this way, we can select the set of nodes {V, W, Y} as safeguarding 
nodes. 

When the constraint of reducing the cardinality of this set comes 
into play, then we must somehow group safeguarding nodes, and 
seek for safeguarding nodes which are close to each group or to 
many groups. (This tradeoff with be analysed in the sequel.) 

Concerning the structure of groups, we have the following char- 
acteristics: 


e Clusters might contain only one node. 
e Nodes comprising a cluster need not be one-hop neighbors. 


S&C is of no use anymore, but we can use the concept of SGBC 
defined earlier. Using this concept, the green node X has high Sink 
Group Betweenness Centrality because it sits on many shortest 
paths towards the two red nodes A2 and A3 comprising Group-2. 
Notice, here that the nodes with high SGSC are not correlated to 
nodes with high SPSC; for instance, the node with the highest 
SP S&C in the network is node Z (it is an articulation point or bridge 
node). 

Now let us look at the impact of safeguarding nodes’ grouping 
on the existence (and/or SGSC value) of safeguarding nodes. As 
said earlier, the grouping creates the following tradeoff: the larger 
the groups we define, the less the nodes (if any) with high Sink 
Group Betweenness Centrality we can find. 

If we unite Group-1 and Group-2 into a single group and ask the 
question ‘which node(s) sit in many shortest paths towards ALL 
members of this new group’, then we can safely respond that only 
node Z is such a node, because of the particular structure of the 
network of Figure 1. Recall that node Z is a bridge node, thus all 
shortest paths from the right of Z to nodes A2 and A3 will pass 
via Z, and all shortest paths from the left of Z towards node A1 will 
pass via Z. 

If we now examine Figure 2, and consider initially a grouping 
comprised by two groups, i.e., Group-1 and Group-2, then we can 
clearly identify the two green nodes as those with the highest 
SGSC. If we create a single large group comprised by all red 
nodes, then we can not find any node with high enough SGBC 
to act as safeguarding nodes. For this grouping, the blue nodes 
have also SGSC comparable to that of green nodes. Thus, with 
this grouping the identification of safeguarding nodes becomes 
problematic. 


4.1 Calculation of SG8BC 


The calculation of SGSC can be carried out using as basis the 
Dijkstra’s algorithm. Algorithm 1 is a baseline one: 
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Red: Node(s) which are our target 


Pink: Initial grouping of target nodes 
Green: Nodes with high SGBC with respect to the initial grouping 


Purple: New grouping (all red nodes into the same group) 


Figure 2: Impact of grouping on Sink Group Betweenness 
Centrality. 


Algorithm 1: Calculation of SG8C. 


1 for each nodes € U7_,Cx to j € {V—UT_, Cy} do 

2 lL SP, = Dijkstra to find shortest-path from s — j; 

3 SP = UysdPs; 

4 for each path p € SP do 

5 lL. Use a hash table to group paths based on start-end; 


6 for each hash table bucket do 
7 Use a hash table to accumulate node appearance in 
paths; 


PROPOSITION 4.1. The worst-case computational complexity of 
Algorithm 1 is O(| Ur C,.| x n2), where n is the number of network 
nodes. 


Proor. Assuming a network with n nodes and m edges, and an 
implementation’ of Dijkstra’s algorithm that costs O(n? + m), ie., 
O(n’), then Lines 1-2 cost O(| UT C,| xn”); lines 3-5 cost O(n”) 
since there are at most n” paths, and lines 6-7 cost O(n’). Oo 


For unweighted and undirected networks, we can design a faster 
algorithm along the ideas of breadth-first traversal and the algo- 
rithm by Brandes [12], but this is beyond the scope of the present 
article. 


5 SGSC WITH NODE WEIGHTS 


Now suppose that each node has a weight associated with it (e.g., 
depicting its trustworthiness, its balance in a transaction network, 
etc), then we need to define the SGSC in such a way that it will 
take into account these weights. Note that this is different from 
having weights in edges (i.e., a weighted network) because that 


“There are various implementations with even smaller cost utilizing sophisticated data 
structures or taking advantage of network sparsity. 
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weights are handled by the shortest-path finding algorithm. We can 
have the following options: 


e We can apply the straightforward idea of multiplying by a 
node’s weight as in [1], ie. 
oi; (v) 


(4) 


with appropriate normalization in the interval [0...1] for 

both node’s weight and SGBC. 

Apply standard procedures for turning the problem of find- 

ing shortest paths in networks with weights on edges and/or 

nodes into a shortest path finding problems with weights 
only on edges. The methods are the following (n - number 
of nodes, m - number of edges): 

— We can split each node apart into two nodes as follows: 
for any node u, make two new nodes, uv, and uz. All edges 
that previously entered node u now enter node wu, and all 
edges that leave node u now leave uz. Then, put an edge 
between u, and uz whose cost is the cost node v1. In this 
new graph, the cost of a path from one node to another 
corresponds to the cost of the original path in the original 
graph, since all edge weights are still paid and all node 
weights are now paid using the newly-inserted edges. 
Constructing this graph can be in time O(m + n), since 
we need to change each edge exactly once and each node 
exactly once. From there, we can just use a normal Dijk- 
stra’s algorithm to solve the problem in time O(m+nlogn), 
giving an overall complexity of O(m + nlogn). If negative 
weights exist, then we can use the Bellman-Ford algorithm 
instead, giving a total complexity of O(mn). 

— Alternatively, we can think as follows: since we have both 
edges and nodes weighted, when we move from i to j, we 
know that total weight to move from i to j is weight of 
edge(i — j) plus weight of j itself, so lets make i > j 
edge weight sum of these two weights and the weight of 
j zero. Then we can find shortest path from any node to 
any other node to in O(mlogn). 

We can multiply each fraction (of shortest paths) in the 

summation formula with the minimum weight (positive or 

negative) found along the path. 


6 SINK GROUP EDGE BETWEENNESS 
CENTRALITY 


The plain edge-betweenness centrality measure is used to identify 
the edges which lie in many shortest-paths among pair of network 
nodes. It has been widely used in network science and not only, 
e.g., for discovering communities in networks [25], for designing 
topology control algorithms for ad hoc networks [15], etc. 

In our context, we ask the question whether we can identify the 
edges which lie in many paths towards a set of subsets of network 
nodes. The extension of SGSC to the case of edges is easy. Thus, 
the Sink Group Edge Betweenness Centrality is defined as follows: 


Definition 6.1 (Sink Group Edge Betweenness Centrality). The 
Sink Group Edge Betweenness Centrality (SGEBC) of an edge 
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e is the fraction of shortest paths leading to any node, which is 
a member of any designated cluster, that e lies on. Formally, we 
provide Equation 5 for calculating the SGESC of edge e. 


n 


ojje 
SGEBC(e)=)) aut) (5) 
i=1 JEUE_, Ck WJ 
v€UL_,Ck 


where n is the number of nodes, C; is the k-th cluster, oj; is the 
number of shortest paths from node i to j, and oj; (e) is the number 
of shortest paths from i to j where edge e lies on. 


7 APPLICATIONS OF SGS8C 


7.1 SGS8C and influence minimization 


We have already explained in the introduction how sink group 
betweenness centrality can be used to limit the spreading in the 
context of influence or infection minimization problems under 
budget constraints. Especially relevant becomes for those prob- 
lems that are online [7], i.e., require a continuous combat against 
the spreading while the infection evolves in various parts of the 
network. 


7.2 SGS8C and virtual currencies networks 


Let us now look again at the network of Figure 1, but this time 
assume that the graph represents a network of transactions being 
made using a virtual currency, e.g., a community currency [47], 
which - differently from BitCoin — is administered by an overseeing 
authority. We have denoted the nodes in deficit (i-e., with lack of 
virtual money) with red color. 

Suppose that this authority is interested in injecting a limited 
amount of money into some nodes with the hope that these nodes 
will buy something from the red nodes and therefore with reduce 
their deficit. Let us call such nodes deficit balancers. Suppose that 
the authority needs to identify a limited number of deficit balancers, 
otherwise will have to divide this amount of money into too many 
deficit balancers, and eventually an even smaller amount of money 
will end up to the nodes with deficit. So, how we identify deficit 
balancers? 

Exactly as before, the obvious solution is to look at the one-hop 
neighbors of the deficited nodes. Then, we can make use of the 
S&C and say that the purple node (Y) is a deficit balancer. The 
purple node sits on many shortest paths towards the red node A1 
(Group-1). In this way, we can select the set of nodes {V, W, Y} 
as deficit balancers. If the authority is constrained to reduce the 
cardinality of this set, then we must seek for deficit balancers which 
are close to each group or to many groups. From this point, we can 
continue along the same reasoning as we did in section 4. 


7.3 SGS8C and community finding 


The concept of SGESC can be used as a component of an algo- 
rithm for discovering the community where a particular set of 
nodes belongs. The idea is that if this collection of nodes is relative 
close to each other, then repeated deletion of high SGESC edges 
will gradually isolate this community of nodes from the rest of the 
network nodes. 
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8 CONCLUSIONS 


Network science continues to be a very fertile research and de- 
velopment area. The more our world becomes connected via 5G 
and Internet of Things (or Everything), the more network science 
evolves into a precious tool for graph-data analysis. One of the 
central concepts in this field, namely centralities despite counting 
half a century of life, is still hot giving new definitions, and new 
insights into the networks’ organization. In this article, we intro- 
duced a new member into the family of shortest path betweenness 
centralities, namely sink group betweenness centrality. The purpose 
of this measure is to identify nodes which are in positions to 
monitor/control/mediate the communication towards subsets of 
networks nodes. We provided a simple algorithm to calculate this 
measure, and also extended it for node-weighted networks, and 
also for the case of edge betweenness. This effort is simply out 
first step in a long journey to develop efficient algorithms for the 
calculation of these measures, to analyze its distribution in real 
networks, to extend them to consider sink group betweenness 
centrality computed for a collections of network nodes, and develop 
techniques for approximating them. 
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ABSTRACT 


Nowadays large-scale data-centric systems have become an essen- 
tial element for companies to store, manipulate and derive value 
from large volumes of data. Capturing this value depends on the 
ability of these systems in managing large-scale workloads includ- 
ing complex analytical queries. One of the main characteristics of 
these queries is that they share computations in terms of selec- 
tions and joins. Materialized views (MV) have shown their force in 
speeding up queries by exploiting these redundant computations. 
MYV selection problem (VSP) is one of the most studied problems in 
the database field. A large majority of the existing solutions follow 
workload-driven approaches since they facilitate the identification 
of shared computations. Interesting algorithms have been proposed 
and implemented in commercial DBMSs. But they fail in managing 
large-scale workloads. In this paper, we presented a comprehensive 
framework to select the most beneficial materialized views based 
on the detection of the common subexpressions shared between 
queries. This framework gives the right place of the problem of 
selection of common subexpressions representing the causes of 
the redundancy. The utility of final MV depends strongly on the 
selected subexpressions. Once selected, a heuristic is given to se- 
lect the most beneficial materialized views by considering different 
query ordering. Finally, experiments have been conducted to evalu- 
ate the effectiveness and efficiency of our proposal by considering 
large workloads. 
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- Information systems — Data warehouses; Query optimiza- 
tion; Database views; Online analytical processing engines; 
Query optimization; Data warehouses. 
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1 INTRODUCTION 


Materialized views (MV) are widely used to speed up query process- 
ing thanks to their storage of the query results. A MV is associated 
with redundancy — it is redundant structure since it replicates data 
[1] and in the same time contributes to reducing redundant query 
computations [35]. MV were named for the first time in 80’s in 
the context of OLTP databases [7]. The arrival of data warehouse 
technology in 90’s put MV as the technique par excellence to op- 
timize OLAP queries [12]. Recently, MV have been leveraged to 
optimizing large-scale workloads involving both user jobs and ana- 
lytical queries [14, 35]. Two main hard problems are associated to 
MV: (1) MV selection problem (VSP) and (2) query rewriting in 
the presence of selected MV [18]. VSP consists in selecting a set 
of MV that optimizes a given workload. In [12], three alternatives 
were given to select MV: (i) materialization of whole workload 
(costly since it requires storage budget and maintenance overhead), 
(ii) materialization of nothing, and (iii) partial materialization by 
exploiting the dependency that may exist among selected views. 
The third scenario is considered in most of the studies related to 
MVP. VSP is one of the studied problems in database field. 

The spectacular interest of database community to the VSP ne- 
cessitates a historical analysis. The studies related to VSP passes 
through three main periods: (1) Golden Age [1990-2010], where 
approximately 100 papers with "materialized view selection" in the 
title are found in googleschoar.com published in major database 
conferences and journals. This golden age is characterized by the 
arrival of data warehouse technology and the crucial need to op- 
timize complex OLAP queries [9]. MV have become an essential 
technique of the physical design phase. Due to this importance 
and the complexity of this selection, several research efforts from 
academia and industry have been conducted. Their findings have 
been implemented in commercial and open-source DBMSs such as 
Data Tuning Advisor for SQL Server, Design Advisor for DB2, SQL 
Access Advisor for Oracle, and Parinda for PostgreSQL DBMS [19]. 

In terms of capitalization on these research efforts, these findings 
allow the definition of a comprehensive methodology for selecting 
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MVs composed of four major phases: (i) view candidate genera- 
tion, (ii) benefit estimation models, (iii) MV selection, and (iv) 
MV validation. 


(1) The first phase consists in selecting common equivalent 
common subexpressions of a workload. Selecting them for 
materialization purposes is a hard task. This is because it 
depends on the join order of each query and on how indi- 
vidual query plans are merged. To perform this selection, 
existing studies focused on proposing graph structures such 
as multiple view processing plan (MVPP) [33] (and its vari- 
ants multi-view materialized graph [6]) and hypergraph [8], 
AND-OR view graph [25], and data cube lattice [12], as sup- 
port for merging individual query plans. These structures are 
exploited by greedy and 0-1 linear programming algorithms 
to select the best-unified query plan. These studies use the 
simplest assumptions (optimal query plan) in obtaining the 
individual plans and then propose algorithms to perform the 
merging. 

(2) To estimate the benefit of materialized a common expression 
in reducing the overall query processing, usually, mathemat- 
ical cost models are needed to evaluate the utility of this 
materialization. This utility can be seen as the difference 
between the cost of executing workload with this expression 
and without it. 

Since all these common expressions cannot be materialized 

due to the MVS constraints such as storage cost, the use 

of algorithms for selecting the appropriate ones is required. 

The survey of [20] overviews the different classes of used 

algorithms in selecting MVs. 

Once all views are selected, the original queries are then 

rewritten based on these views [18]. 


(3 


—S 


(4 


—S 


The second period that we call the Algorithmic Age [2010- 
2018] saw the raise of the development of new algorithm classes 
driven by data mining [16] and game theory [3], by reproducing the 
initial findings related to the other three phases. The third period 
that we call Big Data and Machine Learning [2018-Now] saw the 
VSP getting benefit from different aspects brought from Big Data 
Analytics and machine and deep learning techniques that covering 
the availability of hardware infrastructures and new programming 
paradigms such as Map-Reduce and Spark. The second phase for 
instance has been leveraged by exploiting Map Reduce framework 
to select the best MVPP [5]. Recently, Machine learning techniques 
contribute in launching a new topic known by learned cardinality 
that consists in estimating intermediate results of a given query 
[35]. These techniques have been also used to identify the join order 
[34]. Recently, the requirement of optimizing large-scale workloads 
in the Big Data context has pushed the database community to 
consider the problem of subexpression selection for large-scale 
workloads. 

By analyzing the literature, we figure out that phase 3 received 
much interest compared to other phases, especially in the first 
two periods. One of the impacts of the third period promotes all 
phases, with a special interest to the first one to enumerate the 
different view candidates for a large-scale workload, where query 
plans are merged in bottom-up fashion to generate the MVPP. The 
recent work Yuan et al. [35] is the pioneer in considering all phases 
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together which motivates us to consider the VSP with all its phases 
with a special focus on phase 3. 

In this paper, we propose a global framework for VSP, by high- 
lighting all phases in the context of Big Data Warehouses known 
by analytical queries involving joins, aggregations and selections 
on dimension tables. We claim that the first phase of VSP is a pre- 
condition for having efficient materialized views. Unfortunately, it 
has been hidden by phase 3 in the state of art. It should be noticed 
that the selection of view candidates depends on how individual 
query plans are generated and merged. This generation depends on 
the order of the joins of each query. We perform this selection, we 
propose two classes of algorithms: (i) statistical-driven algorithms 
and (ii) an exhaustive algorithm. The first class is motivated by the 
recent advances of period 3 in terms of using learned techniques 
to estimate different cardinalities. Having such estimation pushes 
allows us in proposing two algorithms that attempt in reducing 
the size of intermediate results of queries: in the first algorithm, 
the fact table is joined with dimension tables following their size 
(from small to large). In the second algorithm, the fact table is joined 
with dimension tables following their FAN-OUT in ascending order. 
This later represents the average number of tuples of the fact table 
pointed by the tuples of a dimension table satisfying the selection 
predicate defined on that table. An exhaustive algorithm called 
ZigZag generates all possible individual trees for each query in the 
workload. Once query plans in the above algorithms are obtained, 
a merging algorithm is then applied to select the final MVPP in an 
incremental way. This merging has to increase the benefit of the 
final selected MV. Once the final MVPP obtained an algorithm for 
selecting appropriate views is executed. In this study, we show the 
strong dependency between all phases of VSP. 

The remainder of the paper is organized as follows. In Section 3, 
fundamental concepts and a formalisation of our studied problem 
are given. In Section 2, we review the most related view selection 
algorithms. In Section 4, we introduce our MVS Framework. Then, 
in Section 5, we detail the main module of our materialized view 
selection Framework. Section 6 details our conducted experiments 
considering large-scale workloads. Section 7 concludes our paper 
and highlights the future directions. 


2 LITERATURE REVIEW 


In this section, we present an overview on the major studies cover- 
ing the four phases of VSP. Works developed in the Golden Age 
[11-13, 15, 17, 24, 28, 30, 32, 33, 36] have proposed various types 
of algorithms to find the best subset of materialized views from 
2™ possible subsets, where m is either the number of views (inner 
nodes), where MVPP/And view graphs, or the number of dimen- 
sions (where lattice cubes are used). [36] introduced a hierarchical 
algorithm with two levels to solve the VSP. They applied evolu- 
tionary algorithm to find materialized views from the MVPP con- 
structed using the heuristic algorithm proposed in [33]. In order to 
reduce the MV maintenance cost, [32] introduced a view relevance 
measure to determine how a candidate view fits into the already 
selected views and how it helps in their maintenance. [24] com- 
bined a simulated annealing algorithm and iterative improvement 
algorithm into a two-phase algorithm. Firstly, the two-phase algo- 
rithm improves iteratively the initial set of candidate views. The 


28 


Bringing Common Subexpression Problem from the Dark to Light: Towards Large-Scale Workload Optimizations 


simulated annealing algorithm tries to find the best MV starting 
from this initial set of candidate views. 

Several studies were conducted in the Algorithmic Age period [2- 
4, 8, 10, 21, 27, 31]. Based on game theory concepts, [3] developed 
a greedy based algorithm. They modeled the VSP as a game with 
two opposite players: the query processing cost player and the 
view maintenance cost player. One player selects views with a high 
processing cost and the second one selects views with a high main- 
tenance cost. Thus, at game-ending, the remaining views having 
a lower query processing cost and a lower view maintenance cost 
are appropriate for materialization. [27] extended the algorithm 
proposed in [3] to an evolutionary version with populations of play- 
ers. After applying a succession of genetic operations (selection, 
mutation, and crossover) on populations of players, the extended 
algorithm finds a better solution than in the case of two players. 
Inspired by techniques used in the electronic design automation 
domain, [8] modelled the global processing plan of queries by hy- 
pergraph, in which the joins shared between a set of queries are 
edges. Then, they proposed a deterministic algorithm to select the 
best set of views to materialize from the MVPP generated from the 
hypergraph. 

Regarding the Big Data and Machine Learning period, [5, 14, 22, 

35] focused mainly on the second phase and proposed cost estima- 
tion models using deep learning models to estimate more accurately 
the benefit of the materialized views. These cost models help in 
selecting the best materialized views that contribute in reducing 
the overall query processing. 
The aforementioned MV algorithms proposed in different periods 
select the best MV from one MVPP constructed by merging only 
the optimal query trees for queries. However, the construction of 
one MVPP from optimal query trees does not guarantee its opti- 
mality [26, 36]. Consequently, other MVPP constructed from merg- 
ing other query trees can provide more beneficial MV. Moreover, 
almost of the reviewed algorithms, materialize additional views 
to decrease the maintenance cost of other materialized views. Al- 
though the decrease of the view maintenance cost, those additional 
views increase the required storage space. 

Motivated by this fact, in this paper, we propose a MV frame- 
work to select the most beneficial materialized views based on the 
detection of the common subexpressions shared among queries. 
Unlike existing algorithms, our framework: (1) manages large-scale 
workloads, and (2) uses several query trees to generate several 
MVPP looking for the best MV composed of a less number of 
views that optimize several queries. 


3 PROBLEM FORMULATION 


3.1 Definitions 


e A Query tree can be represented by a labeled directed 
acyclic graph (DAG) that describes the processing sequence 
of relational operations (e.g. selection, projection, join) that 
produces the result of a query (see Figure 1(a)). Leaves nodes 
are labeled by the base relations of the database schema re- 
quired to process the query and the root node is the final 
result of the query labeled by Q, while the inner nodes are 
relational operations. 
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e A multiple views processing plan MVPP is a labeled 
DAG, which provides a manner to answers the N queries 
in the workload W against the data warehouse. Given N 
queries, an MVPP is constructed by merging incrementally 
their respective N query trees; one for each query, by shar- 
ing nodes defining common subexpressions among queries 
(see Figure 1(b)). Leaf nodes in the MVPP are the base rela- 
tions and the roots nodes are queries, while inner nodes are 
relational operations. 


3.2 Problem statement 


Given a Data Warehouse schema and a workload composed of star 
queries W = {Q], Q2,..., Qn}, our aim is to generate a set of 
materialized views MV = {v1, v2, .... Vz} based on common subex- 
pressions, such that the processing cost of W is optimized when 
using views in MV to answer queries in W. 

The problem of common subexpressions detection is generally 
based on the query tree merging mechanism. It is more challenging 
for a large-scale workload having: (1) a large number of queries, 
and (2) a high number of query trees per query. To illustrate the 
complexity of this problem, we force our-self to follow an incre- 
mental presentation of its complexity. 

Let Q; be a star join query from W involving a fact table and 
x; dimension tables. The number of all possible individual trees 
corresponding to Q;, denoted by IDQ; is given by the following 
equation: 


IDQ;, =2~x (x, -1)! (1) 


This formula supposes that left-deep join strategy is used [29] to 
perform this query. 

Let us assume that the queries of the workload are ordered. To 
generate the optimal MVPP for view materialization obtained from 
merging individual plans is needed. This generation requires the 
following complexity: 


N 
ordered_nb_merging = a IDQ; (2) 
i=! 
In the case that queries of the workload are not a priori ordered, this 
increases dramatically the complexity as shown in the following 
equation: 


N 
no_ordered_all_merging = N! x | | IDQi (3) 
t=] 


This high complexity requires intelligent algorithms to identify the 
most beneficial individual plan merging for view materialization. 


4 FRAMEWORK OVERVIEW 


In this section, we propose a comprehensive framework that lever- 
ages the findings of the two VSP periods (Golden and Algorithmic 
Ages). Our motivation is to bring the problem of common subexpres- 
sions from the dark to the light. Figure 2 illustrates our framework. 
It addresses the common subexpressions detection and views ma- 
terialization by considering three important correlated aspects: (i) 
the definition of individual query plans, (ii) the establishment of 
query ordering, and (iii) the identification of the most beneficial 
common subexpressions for materialization purpose. 
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Figure 2: MVS Framework overview 


(1) To address the first aspect (Query Trees Generator), we 
propose three algorithms to generate the best individual tree 
for each query. The two first algorithms are statistic-based 
since they use the Fan-Out and Min-Max criteria to generate 
only one individual tree for each query. Since statistical 
information is critical and sometimes difficult to have before 
deploying the target data warehouse, exploratory algorithms 
are required. From this perspective, we propose a Zig-Zag 
algorithm that generates all possible query trees for each 
query in W. 

(2) To address the second aspect, we adopt a heuristic tech- 
nique to investigate a reduced number of query ordering. 
The heuristic examines N query ordering. Starting from an 
initial query order, (a) it the corresponding query trees merg- 
ing, (b) detects the common subexpressions, (c) generates 
their corresponding MV, and (d) estimates their benefit. This 
process is repeated N iterations by moving the query at the 
first position in the current query order to the last position. 
At each iteration, the MV Selector in charge of select- 
ing views, identifies the best MV that maximizes the 
benefit. 

(3) Regarding the third aspect, given a query order with the 
query trees for each query, the challenge is to detect the com- 
mon subexpressions shared among queries. To address this 
challenge, we propose an incremental query tree merging 
algorithm that groups queries using the same subexpression 
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into a partition. Each detected subexpression is a view can- 
didate (that will be managed by the Candidate View Gen- 
erator). Since it is more practical to generate less number of 
materialized views leading to optimizing the performance of 
numerous queries in the workload, our incremental query 
trees merging aims at finding less number of common subex- 
pressions (managed by the Candidate MV Estimator). 


5 SUBEXPRESSION IDENTIFICATION FOR 
MATERIALIZATION 


In this section, we show the strong connection between common 
subexpression selection and view materialization. We then outline 
the main generic components of our framework which are: query 
tree generator, candidate views generator, candidate MV estimator, 
and MV selector. Each component is associated to one or several 
algorithms (Figure 2). 


5.1 Query trees generator 


For each query Q; € W, we extract respectively its base tables and 
its selection predicates. Then, we generate for each query a list of 
query trees according to the applied algorithm, namely, Fan-Out, 
Min-Max, or ZigZag. The output of the query generator module is 
a list of queries with their respective lists of query trees. 


(1) Fan-Out algorithm. For each query Q; € W, and for each 
clause cl; of selection predicates used to filter a dimension 
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table T; in Qj, we compute the Fan-Out value of cl; accord- (2) Updating Partitions: The above steps will generate either 
ing to the Formula 4. one or two partitions. This step has to explore the rest of the 
\|FactTable  o¢1, (Ti) queries. In the case where the third query shares common 
Fan — Out(cl,) = ~ (4) node queries in an initial partition, then it will be included 
Fact Table|| in that partition. Otherwise, a new partition is created. This 

where || || and « represent respectively the number of in- process is then repeated till the end of the list. 


stances, and semi-join operation. 

Then, we generate one query tree for Q; by ordering the 
join of the base tables according to the ascending order of 
the Fan — Out values. 


Example 5.2. To illustrate this step, let us consider an ex- 
ample of three queries (Q1, Q2, Q3), where the FAN-OUT 
algorithm is used. Recall that our list will contain three query 
trees as shown in Figure 4. We have to check whether Q and 


Example 5.1. Let us consider the query Q shown in Figure Q> are mergeable. Since these two queries share a common 
3(a). This query contains three tables: Lineorder that plays node (LO » 0¢11(SU)), a partition Pg is then created. This 
the role of a fact table (having 1000 tuples), and DWdate, partition corresponds to the first MVPP (MVPP9). After 
and Customer playing the role of dimension tables. Two that, we have to examine the third query that does not share 
selection predicates are defined on these dimension tables: any node of the MVPP 9. As a consequence a new partition 
dwdate.d_year =’ 1995’ (cl;) and customer.c_region =’ P is created containing Q3. 


ASIA’ (cl2). In this case, we have to compute two FAN-OUT 


éorréeponidingte these predicates (ol) and cls) Thiscoun: (3) Candidate views generation: For each partition P; (of the 
; ; above steps) with more than two queries, we generate a 

putation can be done using SQL statements illustrated in ; 

Fi (by aiid:3(@)- Of couise: this cileulavonrequresse: candidate view Vi corresponding to the node used to merge 

sie q these queri 

cess to the database. Otherwise, the use of statistics is desired ta aaa 

as for other parameters such as selectivity factors, the size Example 5.3. The view v1 shown in Figure 5(a) is a candidate 

of intermediate results. view generated from the merging of the query trees given 

If Fan—Out(cl)) < Fan—Out(clyz), the join order of the by the Query Trees Generator using the ZigZag algorithm 

query Q is LO ™ DA ™ CU, otherwise LO » CU ™ DA. (see Figure 4). The view v1 is defined by the subexpression 

Figures d and e show their individual trees. LO » 0,¢713(CU). 

(2) Min-Max algorithm. For each query Q; € W, we gener- The three steps are re-executed by changing the initial order of 
ate one query tree by ordering the join of the base table the queries. The query order impacts the common subexpression 
according to their size from minimum size to maximum size. detection. Since, the consideration of all query order requires N!, we 

(3) ZigZag algorithm. Contrary to the first two algorithms, it propose a query ordering in a round robin fashion (which requires 
does not assume any a priori join order. So for a given query N orderings). 

Q; accessing x; tables, the ZigZag generates all 2x (x; —1)! 
query trees. 5.3 Candidate materialized views estimator 
As we can see from Figure 4, the query generator enumerates three In the above steps, a list of view candidates is obtained. This module 
query trees: one for each query using Fan-Out/Min-Max algorithms. has to quantify the benefit/utility of each candidate in reducing 
When the Zig-Zag algorithm is used, the query generator has to the overall query cost. To do so, the initial queries of our workload 
consider respectively for Q1, Q2, and Q3: 2 x (4-1)! = 12, 2 x (5-1)! have to be rewritten by considering these candidates. 
= 48, and 2 x (3-1)! = 4 query trees (cf. Figure 4). An example of query rewriting is shown in Figure 5(b), where 
queries Q1, Q2,Q3 of the partition Po (see Figure 4 which is the 
5.2 Candidate views generator case of the ZigZag algorithm) are rewritten using v1. 
This module considers the different query trees generated by our More formally, using each view candidate v;, we estimate the pro- 
above algorithms and aims at merging them in to increase the cessing cost QPC(Q;, vj) for each query Q; € Pj. The QPC(Q;, vi) 
sharing degree among queries. This implies query ordering. At is equal to the query processing of the equivalent rewritten query 
the end, queries with common subexpressions are grouped in a Q; on V;. Finally, we compute the overall query processing cost of 
partition. More concretly, this generator consists in dividing the W in the presence of MV by the Formula 5. 
initial queries into K partitions, where each one contains either 
shared queries or isolated ones. K’ K” 
Let L be a list containing trees of all queries of the workload and QPC(W, MV) = 2 QPCIAwiI+ DY) QPC(Q)_ (5) 
F : ; F i=1;Q; €Pi,vieMV i=1;QieEPi 
indexed by queries. This generator has two main steps:  ———————— 

(1) Initialization of Partitions: To generate our partition, we ees one eee 
start with the first two queries of our list and we try to where K’, K” are respectively the number of partitions with more 
merge their query trees. Note that the merging is possible than two queries sharing common-subexpressions (i.e. partitions 
if a common node exists (selection or join). If a common that group optimized queries), and the number of partitions of only 
node exists, then a partition containing these two queries is one query (i.e. not optimized queries). Using the QPC(W, MV), 
created. Otherwise, a partition is created for each query. we define the Formula 6 to compute the first metric to measure the 
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Figure 3: Query trees generator: case of Fan-Out 
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Figure 5: Candidate view generation and queries rewriting 
benefit of a set MV of candidate materialized view. where QPC(W) represents the query processing of the workload 
W without any materialized views. To select wisely the best MV, 
we define a second metric that measures the quality of the set MV 
Benefit(MV) = QPC(W) — QPC(W, MV) (6) cid 
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defined by equation 7. 


Number_Optimized_Queries 


Quality(MV) = (7) 


Number_Views 
where Number_Optimized_Queries is the number of optimized 
queries resulting from the query grouping, and the Number_Views 
is the number of candidates views, which is equal to K’. We recall 
here that in the computation of the OPC, we used a cost model 
based on the selection cardinality estimation, and join cardinal- 
ity estimation formulas. Table 1 resumes the values of QPC(W), 
QPC(W, MV), and the Benefit(W, MV) expressed in terms of 
Inputs/Outputs (I/O) unit. For lack of space, other values as the cost 
of creation of views and their size are not shown in the Table 1. As 
we can see, for 3 queries, the candidate view v1 generated from the 
merging of the Zig-Zag query trees (see Figure 4, the view with blue 
color) provides better benefit (value with the blue color in Table 1) 
than other views. This is mainly due to the fact that the detected 
subexpression, if materialized, will have utility for the three queries. 
But, as we can see, whatever the algorithm used to generate the 
query trees, the candidate views generator module is able to detect 
common sub-expression between the three queries, such that their 
materialization optimizes significantly the workload. 


5.4 Materialized views selector 


Since we have view candidates for each query order, therefore, we 
need a mechanism selecting the best views. This selection will be 
performed using our cost models. 

In the next section, we outline the extensive experiments con- 
ducted in order to show the effectiveness of the proposed frame- 
work. 


6 EXPERIMENTAL STUDY 


To study the effectiveness of the proposed algorithms, we conducted 
intensive experiments using workloads of different sizes. All exper- 
iments are conducted under a machine with Intel processor Core 
i5 2.9GHz, 8GB of RAM, and 500 GB hard disk. In all experiments, 
we used the well-known Star Schema Benchmark (SSB) [23] as a 
baseline for the data warehouse schema and for the datasets. Under 
Oracle12c! database management system, we created physically 
a data warehouse DW having the schema tables: the fact table 
LINEORDER of size 102M rows (10 GB), and the dimension tables: 
CUSTOMER, SUPPLIER, PART, and DATE, with the respective sizes 
in rows, 3M, 200K, 1.4M, and 2566. 

To evaluate the performance and the scalability of our different 
algorithms associated to the components of our framework, we con- 
sider 10 workloads with the following queries: 30, 60, 90, 130, 160, 
200, 250, 300, 700, and 1000 queries. The results of the experiments 
are discussed hereafter. Figure 6a shows both the query processing 
costs without and with MV selected by Fan-Out/HA, Min-Max/HA, 
and Zig-Zag/HA. Figure 6b shows the improvement rates in query 
processing cost provided by each algorithm for each workload. The 
first lesson from these experiments is that the selected MV by our 
algorithms significantly improve the overall query processing. This 
confirms the main role of MV for speeding up OLAP queries 
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The second interesting lesson is related to the correlation be- 
tween the size of the workload and query performance. As the 
size increases, the improvement rates of the query processing also 
increase (Figure 6b). This shows the strong connection between 
common subexpression and view selection problems. When the 
number of queries increases the probability of sharing is high which 
increases the number of view candidates. 

For instance, for the workload with 1000 queries, the Fan-out/HA, 
Min-Max/HA, and ZigZag/HA provide respectively improvement 
rates of 88,97%, 78,39%, and 83,15% in the query processing cost. 
These algorithms detect respectively 216, 161, 143 common subex- 
pressions. The materialization of these views optimizes respectively 
901, 813, 856 queries among 1000 queries, which represent a high 
performance. 

Based on this above result, we remark that the ZigZag/HA can 
detect the most beneficial subexpressions for materialization. This is 
due to its exploratory search. This is confirmed by the improvement 
rates result shown in Figure 6b. From 30 queries up to 300 queries, 
the ZigZag/HA outperforms Fan-Out/HA and Min-Max/HA. How- 
ever, when the workload increases up to 1000 queries the Fan- 
Out/HA algorithm performs better than Min-Max/HA and Zig- 
Zag/HA. 

These results were predictive when we presented the complexity 
of merging process in Section 3.2. The ZigZag/HA needs large 
runtime to explore large number of query trees merging looking 
for the most beneficial subexpressions. 

If we examine the three algorithms in terms of the quality metric 
computed by the Formula 7, and whose values are plot in Figure 
7a, we can easily conclude that Min-Max/HA and Zig-Zag/HA are 
more effective than Fan-Out/HA. 

Also, we can conclude that for moderate workload Min-Max/HA 
is better than Zig-Zag/HA in terms of effectiveness. But, for a large 
workload, Zig-Zag/HA provides a good compromise between the 
number of optimized queries and the number of selected material- 
ized views. Finally, if the materialized views selection is constrained 
by the space storage limit, in this case, the Fan-Out/HA algorithm 
is the best one. As shown in Figure 7b, whatever the size of the 
workload, Fan-Out/HA is able to select a set of materialized views 
that improves the query processing cost and requires less storage 
space. 

To summarize our result, we can conclude that all three algo- 
rithms are interesting alternatives for VSP. They can be applied 
according to the decision-support application requirements and 
to the structure of the workload. They can be merged by dividing 
workload queries based on their importance. For each class, a spe- 
cific algorithm can be executed. Min-Max/HA and FAN-OUT/HA 
certainly offer interesting performances, but it must be said that 
they are highly dependent on statistics. Whereas, the Zig-Zag/HA 
for moderate workloads represents an interesting alternative. 


7 CONCLUSION 


Nowadays decision-support applications tend to use large-scale 
workloads executing on data warehouses with various tables. This 
situation pushes researchers to revisit the existing findings to opti- 
mize such workloads. Among these findings, we can cite the case 
of materialized views. These techniques if well used can contribute 
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Table 1: Costs estimation in number of I/O 
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Figure 6: (a) Query processing cost, (b) Query processing improvement rate 
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Figure 7: (a) optimized queries on Materialized views, (b) Materialized views size 


in reducing redundant computations detected in cloud queries user 
jobs. The problem of selecting these views is a hard problem. It 
has been widely studied in the traditional data warehouses, and 
existing studies consider workloads with few ten queries. Recently 
it it resurfaces. This situation obliges us to elaborate the historical 
overview of materialized view selection problem, where three main 
periods have been identified: Golden Age [1990-2010], Algorithmic 
Age [2010-2017] and Big Data and Machine Learning [2018-Now]. We 
believe that this historical presentation will help young researchers 
in understanding this crucial problem. In addition to this historical 
review, we succeed in showing the strong connection between view 
materialization and common query subexpressions. In the literature, 
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the problem of common subexpression selection got less attention 
compared to other surrounding problems of VSP. Therefore, we 
attempt to bring it from the dark to the light. This passed by an 
incremental analysis of its complexity and the presentation of a 
framework that combines these two problems. This framework 
contains four components: (1) query tree generator, (2) candidate 
views generator, (3) candidate MV estimator, and (4) MV selector. 
Two classes of algorithms are given for generating individual query 
plans: two called Min-Max and FAN-OUT based on statistics and 
an exploratory algorithm called Zig-Zag. An algorithm for merg- 
ing these plans is also given accompanied by a materialized view 
selection algorithm. These algorithms aim at increasing the utility 
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of the final views in terms of the reduction of the overall query 
performance. Several experiments were conducted to evaluate our 
proposal. The obtained results show the complementarity of our 
algorithms. 

Currently, we are developing a wizard supporting our frame- 
work and studying the possibility to combine these algorithms for 
large-scale workload by partitioning it. Another issue concerns the 
consideration of new alternatives of query ordering. 
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ABSTRACT 


The internet was introduced to connect computers and allow 
communication between these computers. It evolved to pro- 
vide applications such as email, talk and file sharing with the 
associated system to search. The files were made available, 
freely, by users. However, the internet was out of the reach 
of most people since it required equipment and know-how as 
well as connection to a computer on the internet. One method 
of connection used an acoustic coupler and an analog phone. 
With the introduction of the personal computer and higher 
speed modems, accessing the internet became easier. The 
introduction of user-friendly graphical interfaces, as well as 
the convenience and portablility of laptops and smartphones 
made the internet much more widely accessible for a broad 
swath of users. A small number of newly established compa- 
nies, supported by a large amount of venture capital and a 
lack of regulation have since established a stranglehold on 
the internet with billions of people using these applications. 
Their monopolistic practices and exploitation of the open 


“Corresponding Author 


Permission to make digital or hard copies of all or part of this work 
for personal or classroom use is granted without fee provided that 
copies are not made or distributed for profit or commercial advantage 
and that copies bear this notice and the full citation on the first 
page. Copyrights for components of this work owned by others than 
ACM must be honored. Abstracting with credit is permitted. To copy 
otherwise, or republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. Request permissions 
from permissionsQacm.org. 

IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 

© 2021 Association for Computing Machinery. 

ACM ISBN 978-1-4503-8991-4/21/07... $15.00 

https: //doi.org/10.1145/3472163.3472179 


IDEAS 2021: the 25th anniversary 


nature of the internet has created a need in the ordinary 
person to replace the traditional way of communication with 
what they provide: in exchange for giving up personal infor- 
mation these persons have become dependent on the service 
provided. Due to the regulatory desert around privacy and 
ownership of personal electornic data, a handful of massive 
corporations have expropriated and exploited aggregated and 
disaggregated personal information. This amounts, we argue, 
to the colonization of the internet. 
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1 INTRODUCTION 


The internet was introduced to connect computers and allow 
communication between these computers. With the intro- 
duction of portable personal computers and higher speed 
modems, access to the internet became easier. ‘The introduc- 
tion of X-windows, a graphical interface[107] and hardware 
incorporating such graphical interfaces in closed systems 
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brought in more users, including a cult of users of a brand 
and who have remained attached to this closed stystem and 
its new models! The world wide web, developed on a ma- 
chine incorporating a version of this graphical interface in 
the mid 1990s, was followed by a graphical browser along 
with laptops and smartphones, making it possible for a large 
number of users to connect to the internet. Private capital, 
with tacit support of the USAian’ government, was able to 
nourish the emergence of big tech: private capital was on-side 
since there were no regulations and no application of the 
existing regulations and laws to the internet. This free-for- 
all meant that great fortunes could be reaped and existing 
boundaries of acceptable practice ignored by the ’platform’ 
designation of big-tech. The other problem was of course the 
lack of imagination of complacent management of existing 
corporations to provide the additional services. Instead of 
preventing monopolies in the new digital world, legislators 
promoted them to foster innovation at the expense of privacy 
and security. ‘This was what prompted capital to be made 
available to the emerging robber barons of the late 20th 
century. ‘These corporations headed by buccaneers started 
putting down their own rules and bought politicians . Their 
big purses allowed them to bend most politicians and anyone 
with independent thought and ideas was put down by the 
anti-populist forces[43]. 

These newly established companies, supported by a large 
amount of venture capital and lack of regulations and/or not 
knowing how to apply the regulations in the new digital world 
when services were ’free’. These new companies in the tech 
field have since established a stranglehold on the internet 
and essentially colonized it by expropriating the personal 
data of the users of the free service users and mining this 
data to exploit and influence them in subtle ways. They have 
exploited the opportunity and created a need in the ordinary 
person to replace the traditional way of communication with 
what they provide: these persons have become dependent 
on the service provided in exchange for giving up personal 
information. The internet, mobile phone technology, and the 
web, have been exploited by new companies since the origi- 
nal existing players in place were restricted by legislation or 
mostly inertia and lack of foresight. For example, the national 
postal services should have been called in to provide email 
service to supplement traditional mail service. The lack of 
politicians with any foresight, savvy and/or political will and 
the resistance to providing funds to the existing systems such 
as the postal service to build up expertise and infrastructure 
in this new domain meant that this did not occur anywhere. 
Some of these new tech companies, extending and scaling 
their infrastructure have set up cloud services(time sharing 
with less control). Such cloud services are tempting businesses 
to move their computing to such clouds and abandoning their 
existing infrastructure. Examples are the migration of oper- 
ations such as organizational email systems, administrative 
services and so on to the cloud. The result is not necessarily 
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an improvement nor economical. One example of a colossal 
fiasco was the migration of a payroll software system by the 
government of Canada to a untested system called Phoenix: 
"As a result of world-class project mismanagement on the 
Phoenix project, the Canadian government now owns and 
operates a payroll system “that so far has been less efficient 
and more costly than the 40-year-old system it replaced.” [82]. 

As in the past, the introduction of new technology has upset 
the status quo. Opportunities have been missed by established 
players. Industries such as the taxi industry have also suffered 
with the introduction of instant communication and location 
broadcasting mobile phones. New players, falsely claiming 
to be ride-sharing, have carved out a large chunk of the taxi 
business. Companies and individual operators, who paid a 
high price for a taxi permit, were left holding the bag. New 
players, breaking and challenging established regulations and 
using communication technology along with willing drivers 
with automobiles, were able to offer an alternate system and 
take-over a sizable chunk of the taxi business everywhere. 
Again the existing taxi system, with its regulation, did not 
see an oppurtunity to use the new technology. Also, they were 
restricted from charging different amount for the same ride 
as is the case with the so-called ride sharing system which 
could charge a rate depending on the demand, time of the 
day etc. since there was no regulation for the ride sharing 
system. 


2 COLONIZATION 


Throughout human existence, tribes have moved from one 
pasture to another. In pre-historic days, it is likely that there 
were no other humans in the new pasture and if there were 
any, the existing population would either be annihilated, 
absorbed by the new herd or the new herd be assimilated 
into or driven out. The new worlds were invaded by hordes 
from the European countries. Having better weapons and 
using the divide and conquer strategy time and again, these 
invaders (settlers) to the new world were able to overpower 
the existing population. Unfortutunately, the pre-historic 
techniques are still being used to-date in some parts of the 
world[42, 109]. The practice of annihilation, dispossession 
and driving out was gradually replaced by the strategy of 
forceful conversion|71—73]: 

“For over a century, the central goals of Canada’s Aborigi- 
nal policy were to eliminate Aboriginal governments; ignore 
Aboriginal rights; terminate the Treaties; and, through a pro- 
cess of assimilation, cause Aboriginal peoples to cease to exist 
as distinct legal, social, cultural, religious, and racial entities 
in Canada. The establishment and operation of residential 
schools were a central element of this policy, which can best 
be described as “cultural genocide.” ... Cultural genocide is 
the destruction of those structures and practices that allow 
the group to continue as a group. 

States that engage in cultural genocide set out to destroy 
the political and social institutions of the targeted group. 
Land is seized, and populations are forcibly transferred and 
their movement is restricted. Languages are banned. Spiritual 
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leaders are persecuted, spiritual practices are forbidden, and 
objects of spiritual value are confiscated and destroyed. ... 

In its dealing with Aboriginal people, Canada did all these 
things.” [39, 71] 

The practice of assimilation was carried out by successive 
governments|74] including those under a prime minister who 
was awarded the peace prize|62] but did nothing to solve 
the problems at home! The shameful practice of abducting 
children continued well to the end of the 20th century[54, 72]. 
The same strategy was used in the USA as reported in a 
recent article[37]. 

Similar practices were present in all the, so called, new 
world which included N & S America and Australia and to 
some extent Africa. However, the topic of this paper being 
the colonization of the Internet, we will not dwell on this any 
further. 


2.1 Trading Colonization 


The invasion of the new world and its colonization was sub- 
stituted by another type of colonization that started with 
trading by a number of East India Companies with various 
European bases|96—102]. They were established to trade in 
spices and other resources from the Indian subcontinent and 
the orient. These trading companies requiring the protection 
of their territory initially used the company’s hired armed 
man|[103] which was followed by the armed forces of the com- 
pany’s home country|104]. Since the ’invaded’ country had 
a civilization older than the colonizing one, and had a large 
population these trading nations were not able to annihilate 
the existing population. However, just as in the new worlds, 
using a divide and conquer strategy, the existing system of 
governments was replaced by the governments put in place 
by the colonizing country and attempts to discourage the 
existing culture and way of life were the norm. 


3 THE INTERNET 


The colonization we are focusing-on in this article is the 
colonization of what was supposed to be a ’free’ internet. 
Using the philosophy of free internet a handful of big techs 
have not only taken over the internet, they have created their 
own systems to expropriate and exploit private information 
of people everywhere. “Free” web applications are exchanged 
for the recording of every action of the user to influence 
them including creating a need for products and services of 
doubtful use[3]! 

According to an article in The Atlantic[9] the internet 
and the web was imagined in the 1930s by Paul Otlet, a 
Belgian bibliographer and entrepreneur. He sketched a plan 
of global telescope to enable global sharing of books and 
multimedia[10]. Others who followed simply implemented 
this plan though it required development of technologies and 
associated how-tos. In [47] the ’invention’ of internet is not 
credited to one person as a number of individuals are noted 
for proposing some of its mechanisms including the concept 
of transmission of data in the form of small packets, the 
addressing mechanism|93] etc. Some of these had to evolve 
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with the connection of more and more computers. Also there 
were two separate networks one in UK and the other in USA. 

The concept of the interconnected documents was also 
mentioned in a 1945 article ’As We may Think’ in The At- 
lantic by Bush[92]. This was followed by hypertext documents 
of various types[105]. The developments at Conseil européen 
pour la recherche nucléaire(CERN) in the late 1980s used 
this basic idea and a simple application layer protocol on 
top of the tranpost layer protocol(TCP) and the internet 
protocol(IP). The hyper text transport protocol(HTTP) was 
yet another application similar to the existing file trasfer 
protocol(FTP), simple mail transport rotocol(SMTP), do- 
main name server(DNS) to use the underlying Transmission 
Control Protocol(TCP) to enable a hypertext connection 
between devices, one being a client the other a server. Also a 
rudimentary syntax, which has to be upgraded many times, 
for creating the textual documents and the links among these 
was proposed|[53]. Many other people have contributed to 
original idea of hypertext and have since to the development 
of the web, most of these has been for exploiting and mone- 
tizing under the world wide web consortium(W3C). It is like 
the western myth that Columbus discovered America: he did 
not - he thought he had reached Asia. Incidently america 
did not exist!! Hence web’s “invention” cann’t/mustn’t be 
attributed to a single person. Rather what was done was an 
implementation of an application using existing ideas and 
the flexibility built into TCP/IP, OSI[110]?. The current web 
is as different as today’s airplanes from the one imagined by 
Icarus and Daedalus. 

As noted in [23] “even before the introduction of the web, 
the internet had made it possible for people to communi- 
cate via electronic mail (email)[52] and on-line chat (talk), 
allowing sharing of files using anonymous file transfer pro- 
tocol(FTP), news(Usenet News), remote access of computer 
(telnet) Gopher(a tool for accessing internet resources), Archie 
(a search engine for openly accessible internet files) and Veron- 
ica (search for gopher sites). These early systems afforded 
the opportunity of interconnecting people (who wanted to 
be connected), sharing resources without requiring anything 
in return and providing security and privacy; there was not 
yet any question of monetizing; the whole concept was to 
share and there was no attempt to exploit! However, these 
systems were not adopted widely: the problem with these 
internet tools was the need to have computing savvy; the 
other problem was the lack of an infrastructure to transfer 
the know-howto. Incidentally this was also the requirement 
for the early web with the use of a user unfriendly, text-based 
web browser and lack of training facility and easy to learn 
tools to build and maintain hypertext documents. Some early 
attempts to create software for hypertext/108] were buried 
by the emergence of the early form of the tech giants who 
were more interested in having their system be the internet 
and crippling the users from learning the basics. 


?In April of 1993, CERN put the web software in the public domain; 
in June of 2021 a non-fungible token (NFT) of world wide web source 
code sold for $5.4m - another monetizing ploy! 
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4A INTERNET COLONIZATION 


The author had chaired/co_chaired a number of workshops 
during the early meetings of WWW/(14, 15]. A system at a 
clients site called WebJournal was under construction in mid 
1994[26]: it was to provide a record of web sites discovered 
during a web journey and thus provide a record for latter 
reference|27]. This information was recorded locally and there 
was no need for searches. However, the option of using a web 
robot to scour the internet and to create a comprehensive 
list of web pages and index them was introduced later. The 
concept of robots as well as many other features used to track 
users was not part of the initial design from CERN. These 
were introduced by W3C, dominated by USAian business, to 
serve the needs for tracking among other monetizing tools. 

Altavista, one of the early search engines, was introduced 
by Digital Equipment Corp.(DEC) in December 1995[106]; it 
had a simple design but due to many management blunders 
lost the search war and was shutdown in 2013. An upstart 
searching application, Google, claimed to be a better search 
engine because their search result rankings were based on the 
number of ’respected’ pointers pointing to the page. Even 
though Google, whose results in the beginning were middling 
as shown in the tests reported in [18, 22], soon took over 
the lead and now has the playing field to itself. Its sheer 
global coverage and complete control of the digital publicity 
marketing, including the publicity trading exchange, the main 
buyer and seller[41], has prevented local search engines from 
emerging and challenging this dominanace[23}. 

Over the last few years, another USAian search engine 
that promises not to track users called DuckDuckGo has had 
some success. CLIQZ was a recent example of an European 
attempt to create a more open search engine integrated 
with a web browser[34]: on their web site they point out 
the colonization issue: ” Europe has failed to build its own 
digital infrastructure. US companies such as Google have 
thus been able to secure supremacy. They ruthlessly exploit 
our data. They skim off all profits. They impose their rules 
on us. In order not to become completely dependent and 
end up as a digital colony, we Europeans must now build 
our own independent infrastructure as the foundation for a 
sovereign future. A future in which our values apply. A future 
in which Europe receives a fair share of the added value. A 
future in which we all have sovereign control over our data 
and our digital lives. And this is exactly why we at Cliqz are 
developing digital key technologies made in Germany.” [34] 

Alas, on April 29, 2020[33] Cliqz story was over. According 
to the team, they were able to build an index from scratch and 
introduced many innovations but combined with the Covid-19 
pandemic and the continued dominance of the other systems, 
realized that there is no future for Cliqz. The CLIQZ team 
built a browser that protected users’ privacy using a powerful 
anti-tracking and content blocking technology. And of course 
a search engine. Yet they did it with a modest budget and 
attracted top talent. 

Even though CLIQZ had daily users in the hundreds of 
thousands, they were not able to meet the cost due to the 
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inertia of users continuing to favour the colonizing giants. 
Worst of all the political stakeholders, both in Germany, 
where CLIQZ was based and the EU, have not woken up to 
the fact that they are supporting a colonized Internet and 
the colonizing power is USAian big tech with tacit support of 
successive USAian governments. After a heroic attempt, the 
search engine has hit the dust(cloud?), the CLIQZ browser is 
still around - at least for a while. Currently, in most countries 
of the world, even though they may have a local search engine, 
the lion’s share of the search is using Google! 

It is evident that most democratic countries need their own 
digital locally regulated internet infrastructure. The myth 
that the Internet is free and open has been exploited by 
many. The world deserves a fairer democratic non-colonized 
Internet, web and online social networks(OSN). 

According to [68] ” Politicians and public officials were 
complicit in Facebook’s legitimization as a political forum. 
Special credit has to go to Barack Obama, as a presidential 
candidate in 2008, for demonstrating that a power base lo- 
cated on Facebook could take you all the way to the White 
House.” This platform has been used by other politicians, 
dictators and political parties, and others to swing elections 
and mold peoples perception of reality and present an ’alter- 
nate’ reality which is usually a mirage at best and in reality 
an untruth. “But the company has repeatedly failed to take 
timely action when presented with evidence of rampant ma- 
nipulation and abuse of its tools by political leaders around 
the world” [50, 51, 55]. 

The large internet companies, using the advantage of the 
early start, the protection of the USAian government and the 
guise of net freedom have been enjoying a non-level playing 
field in web technology. The presence of a colossal corporation 
in search and on-line social networks is preventing any other 
attempts to succeed. Every time the issue of regulating big- 
tech comes up in the USA, the big-techs and their allies 
start fear mongering about giving China the advantage. This 
distracts from the important question of addressing the issue 
of exploiting the users’ data and violating their privacy; take 
away the range of choices by offering a limited number of 
options beneficial to the big-techs[90]. The OSNs have killed 
the early attempts to create software for establishing one’s 
own web presence, not only for individuals but also for most 
small organizations. One of the first things these OSNs did 
was to recruit USAian politicains and showed them the ease 
with which they can, with very little computing savvy, set up 
their interactive web presence and share it with thier electors. 
Others followed and they all joined in like lemmings. In a 
way the OSNs have become a road-block for personal and 
community based sharing systems with local control. 

Another problem has been the lack of imagination and 
inaction of most western governments, postal services and 
telecom utilities to provide the tools. ‘The governments have a 
false faith in a free market which has never been free: the big 
ones dominating any start-ups and competition. These giants 
have become too large to regulate, and they have a large 
network of lobbyists and lawyers with direct access to the 
legislative and executive bodies of the USAian government. 
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There is not a single international search engine of any 
size that is not headquartered in the USA. The attempt by 
Cliqz failed in-spite of their success in creating their own 
system and integrating it in a privacy preserving browser. 
Another example of this imperialist push, as noted in [23] 
was the attempt by Facebook to have a completely controlled 
free service provided by Indian telecom carrier which would 
have Facebook as the center with a few other services chosen 
by Facebook. This was an attempt by Facebook to make 
itself the Internet for potentially over a billion users on the 
sub-continent. 

Most governments have not come around to adequately 
tax these foreign companies. Even the recent attempt by 
the G7 countries to impose a paltry 15% income tax|7, 70] 
has loopholes and most of these big-techs hardly pay a tax 
of even 4%[70]. One wonders why the governments do not 
tax these companies on the revenues earned in the country, 
regardless of the location of the big-tech. It is so easy with 
today’s database and data analytic tools to determine the 
revenue earned in each country and tax the companies on this 
revenue and not allow them to play the shell game. However, 
incompetent politicians would not listen to even their own 
civil servants much less ask them to implement the system 
and put in laws. Of course these laws would have to override 
any ‘free’ trade agreement or even walk out from them. 

A limited number of colonizing on-line social networks have 
attracted people from all parts of the world and given the 
despots around the world the ability to be heard everywhere 
without any checks and balances. The other issue is that some 
governments are trying to control such networks who have to 
comply so as not to lose their income from the country. Case 
in point is the recent attempt in India to remove contents 
critical of the government. China is forcing Apple to host all 
data of their citizens in China. In this case this data would 
likely be accessible to the communist government: Apple has 
no choice but to comply since it would risk a large portion of 
its global business in China and most of their manufacturing 
facilities. In this way, Apple has become an instrument to 
present a government-controlled version of the internet[58]. 


5 CLOSED SYSTEMS 


The marketing of computing systems in the early days in- 
cluded the bundling of basic software support. This included 
the operating system, the compilers and libraries as well 
as training manuals. An organization would either buy the 
bundle or lease it and develop the specific software appli- 
cations for its own use in-house. Computer Science evolved 
to train people who would develop this application software. 
The competitors to IBM, the most successful manufacturer 
of bundled systems, were software only houses. These com- 
petitors, using the courts and USA’s anti-trust laws, were 
successful in un-bundling software from the hardware. The 
anti-trust case was based on the rationale that people who 
wanted software should not have to buy the hardware as 
well. This anti-trust case was finally dropped but it gave rise 
to a number of software houses. This and the idea of one 
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size fits all led to the establishment of software houses which 
produced sets of generic software that could be used for many 
businesses and replace the in-house systems. The concept of 
a bundled system that IBM used in the mid-sixties for its 
System 360 was subjected to litigation and prompted the 
unbundling of software and hardware. This un-bundling and 
the introduction of the PC by IBM which was an un-bundled 
hardware system gave rise to the birth of one of the five 
big-techs: sometimes called the fearsome five|65]. 

What has happened now is closed devices are sold today 
that have the software, including tracking sub-systems, built 
into them[45]. All the software applications(apps) created 
by independent software houses are installed via the oper- 
ating system of the device and the device maker imposes a 
percent of any revenues earned by the application. There 
is no move anywhere to unbundle software, including the 
applications and the hardware. This is clearly against the 
spirit of the System 360 settlement. However, it has been, to 
date successfully used by the big-techs and is being imitated 
by others. One would expect that since such closed-device 
makers are controlling the ‘application store’ they would 
have some diligence in ensuring the quality of the software 
they make available and take a hefty percent of the revenue 
earned by the application maker. Recent articles in the press 
have shown that some of this software, as usual has bugs 
and security loopholes which could be hacked by spyware 
makers. One of these is attributed to a spyware firm in Israel 
which has targeted activists|77|]. These systems, especially 
Internet of Things (IoTs, cell phones being one of them!), 
lock in the users data without providing a method for user 
to take care of their own data. An architecrul solution and 
proof of concept to address this problem was presented in 
[1, 20]. 

The latest trend is abandoning in-house systems including 
email systems by moving them to a cloud run by one of these 
tech giants. The promise of tremendous cost savings is often 
an illusion. From the experience reported in a report by the 
Canadian Senate and the Wikipedia page|82, 91], one can see 
the fiasco caused by the system “Phoenix”, mentioned earlier, 
bought by the Harper government to save money. After five 
years of continued complaints about underpayments, over- 
payments, and non-payments, due to a software system that 
was supposed to save $70 million a year to fix Phoenix’s 
problems, it will cost Canadian taxpayers up to $2.2 billion 
by 2023 according to a Senate report 

The truth of the matter is that these big-techs are too 
big and have colonized the internet. The big-tech business 
model is to get as many people as possible to spend as many 
hours as possible on its site or their device so that they can 
sell those people’s attention to advertisers. The myth that 
the internet is free is a farce. Each society, each city, each 
community must have their own contents under their own 
jurisdiction and control of accountable elected officers. 

Some of the for-profit big-techs that run social media, 
make a claim that they support social justice; however, their 
products and their marketing models do not reflect this lip- 
service|49]. They claim that they spend billions of dollars 
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for work on AI to address these problems, but their model 
uses the research which shows that divisive contents attract 
and keep the audience. Also, how much of these billions 
is to support tools to weed out objectionable material in- 
cluding hate speech, pedophilia and false claims[83]. Some 
administrations have been known for promoting “divide and 
conquer” practised by invaders over centuries. By inventing 
a tag such as “newsworthy” for any content that violates 
accepted decorum but coming from some political figures is 
allowed because such contents are judged, not by independent 
observers but the big-techs themselves, to be ’newsworthy’. 
The label does not consider inaccuracies or falsehoods nor 
whether it is hateful/28, 66, 89]. 


6 FALLOUT FROM THE 
COLONIZATION 


The big techs have convinced billions of users that the service 
they are getting is free. If we ignore the intangible cost in 
terms of the loss of privacy and hawking of personal data, 
images, opinion etc. to anyone who is willing to pay for 
them, the service offered by these big-techs is really NOT 
free! In order to access their service or any other, there is 
the cost of device and bandwidth needed to access these 
services. The device will cost from several hundred dollars 
to a couple of thousands. The communication costs, ranges 
from 50 dollars to up to 100 per month. So the consumer is 
paying. In addition to these costly “free” services, businesses 
such as utilities, credit card companies and others want their 
customers to be billed electronically. This would save them 
the mailing costs but they do not discount the users bill 
by a corresponding amount. There are no regulations about 
passing such costs back to the customer and one wonders if 
there ever would be! One can set up a personal mail service 
for a few dollars a month. As noted in [44] when a product 
is free, the user is the product.” 

These companies are invading new territories where there 
are no regulations. Another such plan hatched is Amazon’s 
Sidewalk[3]*: Amazon is one of these big-tech businesses and 
Sidewalk is a bridge-scheme they have hatched to ’steal’ a 
users bandwidth with no remuneration and no guarantee 
that it could not be used to hack into a person’s system! 
Amazon Sidewalk works by creating a piggy low-bandwidth 
network using someone’s smart home devices the person 
has purchased; it also uses the persons telecommunication 
bandwidth without any permission except an opt-out. The 
system would likely be extended to devices and applications 
from third-parties that Amazon would later license. Since 
there are no regulations, this company is basically stealing 
the bandwidth, however small, from customers who are naive 
enough to pay for an untried product that they may not really 
need just because, as usual, they are hyped up and marketed 
to target a persons insecurity, and concern for safety and 
security. 


3Running out of monikers - this word reminds one of the fiasco Google 
made with its version of Sidewalk in Toronto waterfront project and 
withdrew when they did not get their slice of the pie[2, 32, 40]. 
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Since there are no regulations and laws to control the 
internet and its components such encroachment on privacy 
and security will continue. The internet including the mobile 
phones, have become a gold rush of our times and anyone, 
with support from venture capital, can stake a claim. Systems 
are designed as closed systems and they allow any third- 
party and applications to access the users data. The rush 
is to grab all possible types of user data, and this includes 
financial institutes which charge a fee for a customers account. 
Recently, RBC, the largest bank in Canada made a condition 
of on-line banking for its customers to give permission to 
use, anonimized data of their on-line transactions in any way 
they see fit to any third party they choose without concern 
for thier customer privacy* . As usual,the on-line form for 
giving this permission was innocuous but when one follows 
the link there is a 37 page document with all the legalese. 
One wonders where the regulating agency is and what are 
they regulating? If lawsuits and litigation, mostly based on 
monopoly legislature, are the only way|38], the entire system 
is going to be bogged down for years to come and may not be 
satisfactorily resolved. A lawsuit against Facebook, launched 
in British Columbia, Canada is going back and forth from 
one court to the another since 2013/30, 59]. 


World - Editor's Note 


Why CBC is turning off Facebook comments on 
news posts for a month 
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Social media attacks on our journalists and the subjects of our stories is something 
we take seriously 
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What does the agreement cover? 


Read & understand 37 pages of legalese in minutes! 


“There are ways toidentify a person from anonimized data|63] 
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One of the reasons that these big-techs have become so 
large lies squarely with the media - to these one can add 
governmental agencies, public institutes -including universi- 
ties and private businesses. They have all rushed in to get 
free’ exposure. They are providing legitimacy to these robber 
barons of our age by strategically displaying the logos of these 
mammoths on their own web sites; these logos take visitors to 
these sites to the big-techs sites where they may require them 
to sign-in/log-in for interaction. This gives these big-techs 
new victims to mine their personal data. For example most 
universities havea presence on the big-tech sites because the 
others are there - even though they have their own web site 
over which they have complete control. Any way, we are in 
the vicious circle: agencies, organizations and universities 
are using these sites because more people are there; however, 
more poeple are there because all of the above are promoting 
and using these sites! Makes one wonder what the I in the 
new high office called CIO stands for! 

Why can’t universities, centers for education, manage their 
own interactions with their alumni, students and prospective 
students and not have to go through these third parties?° 
Furthermore, many organizations allow/encourage users to 
log-in to their systems using the credential for Facebook or 
Google! Effectively they offer their customers as sacrificial 
lambs to these tech giants who push the boundary of civic 
decencies for a greater share of the market. The media is 
full of items chock full of quotes of posts and twits; these 
in turn lure more unsuspecting souls to be trapped in the 
web of these monopolies and provide them with more tons of 
personal data. 


6.1 Survival of Newspapers and 
Journalism 


Another fallout of the monopolies established in email, web 
search social networks, cell phones, computing devices and 
shopping is the effect it has on journalism and local news- 
papers. Recently, an open letter to the Prime Minister of 
Canada was published in many Canadian newspapers|[56] to 
communicate the following: 


“For months, you and the Minister of Canadian Heritage, 
Steven Guilbeault, have promised action to rein in the preda- 
tory monopoly practices of Google and Facebook against 
Canadian news media. But so far, all we’ve gotten is talk. 
And with every passing week, that talk grows hollower and 
hollower. 

As you know, the two web giants are using their control 
of the Internet and their highly sophisticated algorithms 
to divert 80% of all online advertising revenue in Canada. 
And they are distributing the work of professional journalists 
across the country without compensation. 


>It is argued that the reason people flock to these OSN sites is because 
of their ’better’ interface: this is a fallacy. The OSN interface may 
not, necessarily, be better! Most people are comfortable with what 
they are used to and are too reluctant(busy/lazy) to learn. Also, these 
interfaces have evolved. 
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This isn’t just a Canadian problem. Google and Facebook 
are using their monopoly powers in the same way throughout 
the world — choking off journalism from the financial resources 
it needs to survive. 


In fact, the health of our democracy depends on a vibrant 
and healthy media. To put it bluntly, that means that you, 
Prime Minister, need to keep your word: to introduce legis- 
lation to break the Google/Facebook stranglehold on news 
before the summer recess. It’s about political will — and 
promised action. Your government’s promise. 

The fate of news media in Canada depends on it. In no 
small way, so too does the fate of our democracy.” ° 


If one looks at the fight put up not only by Google and 
Facebook which amount almost to blackmail but also by 
the USAian governemt recorded in the submission|5] one 
understands the part this government plays in this type of 
colonization. It is hard to understand how these submissions 
fail to see that Australia was addressing the market failures 
with digital news content and digital advertising by combining 
elements of the French Press Publishers’ Rights the collective 
bargaining of publishers’ licensing against the market power 
of publishers, as well as a novel process for the negotiation 
and, if necessary, arbitration of prices[12, 75]. 

The USA, which has put in a ’platform’ designation for 
many of these big-techs and thus exempted from all require- 
ment of diligence of what appears on their site seems to 
be behaving like the colonial governments did in the early 
days of the East India Company; they made it easier for 
these companies to exploit people, enslave them figuratively 
or literally using their military power and using the divide 
and conquer rule. In the case of the Australian draft law to 
require Google and Facebook to deal with a consortium of 
news media it is just to provide a balance. It submitted an 
opposing brief as being not appropriate to put in collective 
bargaining by any number of media players to bargain to- 
gether as not respecting the principles of competition’ [35]. 
The tech giants using their size and the political connections, 
are able to dictate their own unfair terms to news media for 
the use of their content. Even the treatment of their own 
employees could also seem to be callous. [57] 

One also notes the submission [6] which includes self praise 
and points to the initial design of the link and free access 
without any mention of the monopolization of the internet 
and explotation of personal data for exorbitant private profit. 
One wonders what part was played by W3C[112] in the 
introduction of cookies and tracking and other tools not in 
any design since Otlet!. There is no concern about copyright, 
fair practice, ownership, privacy etc.: no credit to all the 
others who had put forward the ideas of hypertext. The 
reading of such submissions makes one think of the famous 


© Alas, nothing was done and summer recess has been called. In the 
meantime these companies control completly the digital advertisement 
market [41]. 

"Are the big-techs respecting the principle of competition? 
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lines, attributed most likely to some prolific author uttered so 
often to the delight of the class, by many high school teachers 
to one of their unruly students: ”It is better that you keep 
your mouth shut ....’ 


7 A POSSIBLE SOLUTION 


In addition to the introduction of regulations and legisla- 
tion to protect privacy and stop the trade of user data by 
tech companies and businesses the only way to liberate our- 
selves from this internet colonization is to stop using these 
colonizing, so called, ’free’ services|46]. 

The teaser figure above shows the statue of the salt march 
which was the start of the struggle against the British col- 
onization of India[88]. Similarly, the fight against the colo- 
nization of the internet and to stop the violation of human 
rights by hijacking of peoples privacy must start. This fight 
should involve not only ordinary users but also organizations. 
The latter should stop using the logos of these big-techs 
on their web site and thier log-in credentiala. Since most 
of these organization already have a web site, they should 
invest in infrastructure to add interaction with their users 
and customers. This would not only remove duplication but 
also stop sacrificing their users and clientele to feeding the 
big-techs. Since the path to a web-site is the browser, they 
must be more privacy consious and dis-allow tracking and 
fingerprinting etc. Also, they currently have a facility to store 
log-in credentails, this must be made more robust and open. 
So a user should be able to access the data which must be 
stored locally with a secure password - thus the user needs 
to remember but one passowrd. 

The big-techs, specially the OSNs, have marketed that the 
traffic to the organization’s web site hosted without charge 
on the OSN ’free’ site would increase because of the large 
numbers of OSN users. However, one link is just as good 
as another and any user with any brains should be able 
to find an organization because of all the search engines 
including those that do not track. Furthermore, the presence 
of the independent organizations’ pages are instrumental in 
increasing the number of users and traffic on the OSNs. 

For the ordinary user, there is another way to get a web 
presence and sharing without using OSNs. Since these users 
need to use an internet service provider(ISP) most of them 
also provide a web presence and email service: users should 
look at these services as an alternative. As more people use 
these alternatives, the services would gradually improve and 
additonal services may be added. One can, using open source 
system such as Linux set up one’s own server. Linux has 
many distributions; some of them are forthe not too technical 
savvy and help is provided via numerous discussion forums. 
Educating ordinary users in the mystrey of web and databses 
aetc. should be addressed. One attempt by the author, and 
a colleague is to publish a text and the code for a set of 
examples; this could allow even a novice to set up a database 
driven web presence; this volume is ’open access” under a 
copy-forward scheme[24, 25]. 
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Do they need these OSN? They have own Web sites! 


In [20], a scheme to protect the users’ privacy by keeping 
all digital data of IoT’s, including cell phones, under control 
of the user was introduced. It is also needed to introduce 
regulation and legislation to crack open the closed systems 
used by many of these big-techs. Democratic countries must 
not wait for the USAian government to start the process 
since USA is protecting its big-techs as evidenced by the 
presentations made by various USAian agencies during the 
Australian senate hearings on news media|5]. Any sane judge 
would uphold laws and regulations allowing a balance in 
negotiation between a giant and a consortium of local small 
publishers. Also, according to the legal opinion cited in[75], 
the most recent trade agreement between Canada, Mexico 
and USA, there is the “ability to take legitimate policy ac- 
tions in the public interest, including with respect to health, 
the environment, indigenous rights, and national security; 
and for Canada to take measures to promote and protect 
its cultural industries. Action taken under the authority of 
the exceptions is permitted even if it otherwise would have 
violated obligations in the Agreement.” 

The newspaper publishers should also provide access to 
their digital contents either free or for a very small fee; many 
newspapers already do this: however, the cost is too high! 
This access, in addition to providing content, could be ex- 
tended to include community discussions and forums. Also 
to provide the subscribers means to interact with others, set 
up forums to discuss local concerns, provide pointers to local 
resources and provide the civil network for the community. 
If more readers go directly to the media’s web sites, they 
would be tempted to make a voluntary contribution to sup- 
port these services. Recently, Le Presse, a French language 
newspaper in Montreal, went all digital and became a non- 
profit organization. In this way they could accept donations 
from readers. However, since most news media are privately 
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owned by for profit corporations, they would not be willing 
to give up the ownership. However, some means of opening 
up and providing more community services would be one 
way for them to survive. The cost of creating an application 
for interaction and feedback is not astronomical! Over the 
years, the author has assigned a group project to students in 
an undergraduate course in databases. The project usually 
requires a web based application to mimic one of the social 
media systems. Most of the implementations are better than 
the initial system developed by some drop-outs and pushed 
with the help of some incompetent USAian politicians and 
businesses around the world and groupie users. 

This model could be extended to other media such as 
videos , and then the community could have a streaming 
service not controlled by giants but by a local consortium. As 
one notices, the giant streaming services are controlling the 
production of, at most mediocre, contents. The best content, 
the classical ones are no longer to be found. The irony of this 
is that the production by these giant streaming services is 
financed by our taxes awarded by naive politicians looking 
for a trickle down effect. Alas there is none since the rich 
management of all these giants have built dams to prevent 
this loss! 


Acknowledgement 


The author would like to acknowledge the valuable discussions 
with and the contribution of Drew Desai (Univ. of Ottawa) 
and Sheila Desai(BytePress); also, many researchers and 
journalists cited and perhaps missed; these have been valuable 
in preparing this article. 


REFERENCES 


[1] Ayberk Aksoy; Bipin C Desai: Heimdallr_1: A system design 
for the next generation of IoTs ICNSER2019: March 2019 Pages 
92-100 https://doi.org/10.1145/3333581.3333590 


[2} Andrew Clement: Sidewalk Labs’ Toronto wa- 
terfront tech hub must respect’ privacy, democ- 
Tracy, The Toronto Star, Jan. 12, 2018, 


https://www.thestar.com/opinion/contributors/2018/01/12/ 
sidewalk-labs-toronto-waterfront-tech-hub-must-respect- 
privacy-democracy. html 

[3] Alex Hern: Amazon US customers have one week to opt 
out of mass wireless sharing The Guardian, Junl 1, 2021 
https: //www.theguardian.com/technology/2021/jun/01/amazon- 


us- customers-given-one-week-to-opt-out-of-mass-wireless- 
sharing 

[4] Annalee Newitz: A Better Internet Is  Wait- 
ing for Us NY ‘TImes, Nov. 30, 2019 


https://www.nytimes.com/interactive/2019/11/30/opinion/social- 
media-future. html 

[5] Submissions received by the Committee Item 13, 13.1, 17 

https: //www.aph.gov.au/Parliamentary_Business/Committees/ 

Senate/Economics/TLABNewsMedia/Submissions 

[6 Submissions received by the Committee Item 46 

https: //www.aph.gov.au/Parliamentary_Business/Committees/ 

Senate/Economics/TLABNewsMedia/Submissions 

[7] Alan Rappeport; Liz Alderman: Yellen Aims to Win 
Support for Global Tax Deal NY ‘TImes, June 3, 
2021 https://www.nytimes.com/2021/06/02/us/politics/yellen- 
global-tax.html 

[8] Adam _ Satariano: Facebook Faces Two Antitrust 
Inquiries in Europe NY Times, June 4, 2021 
https://www.nytimes.com/2021/06/04/business/facebook- 
eu-uk-antitrust. html 


IDEAS 2021: the 25th anniversary 


[9] 


[10] 


[11] 


[12 


[13 


[14 


[15] 


[16 


[17 


[18 


[19 


[20 
[21 


[22 


[23 


[24 
[25 


[26 


[27 


[31] 


[32] 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


Alex Wright: The Secret History of HAy- 
pertext The Atlantic, May 22, 2014 
https://www.theatlantic.com/technology/archive/2014/05/in- 
search-of-the-proto-memex/371385/ 
Alex Wright: Cataloging the World: Paul Otlet and the Birth 
of the Information Age Oxford University Press 2014, ISBN: 
9780199931415 
Ben Brody; David McLaughlin; and Naomi Nix: U.S. 
Google Antitrust Case Set to Expand With GOP 
States Joining Bloomberg, September 11, 2020 
https://www.bloomberg.com/news /articles/2020-09-12/u- 
s-google-antitrust- case-set-to-expand-with-gop-states-joining 
What Is Baseball Arbitration? June, 9, 2011 
http: //www.arbitration.com/articles/what-is-baseball- 
arbitration.aspx 
Bipin C. Desai: The Web of Betrayals IDEAS 2018: pp 129-140 
https://doi.org/10.1145/3216122.3216140 
Bipin C. Desai: Navigation Issues Workshop First 
World Wide Web Conference, Geneva, May 28, 1994, 
http: //users.encs.concordia.ca/ bcdesai/web-publ/navigation- 
issues. html 
Bipin C. Desai, Pinkerton, Brian: Web—wide Indezx- 
ing/Semantic Header or Cover Page Third Interna- 
tional World Wide Web Conference, April 10, 1995, 
http: //users.encs.concordia.ca/ bcdesai/web-publ/www3- 
wrkA/www3-wrkA-proc.pdf 
Bipin C. Desai: JoT: Imminent ownership Threat Proc. IDEAS 
2017, July 2017 https://doi.org/10.1145/3105831.3105843 
Bipin CC. Desai: Privacy in the age of informa- 
tion (and algorithms) Proc. IDEAS 719, June 2019, 
https: //doi.org/10.1145/3331076.3331089 
Bipin C. Desai: est: Internet Indexing Systems vs List of Known 
URLs Summer 1995 http://users.encs.concordia.ca/ bcdesai/web- 
publ/www3-wrkA /www3-wrkA-proc.pdf 
Bipin C. Desai: Privacy in the Age Of Information 
(and algorithms) IDEAS 2019, June 2019, Athens, Greece 
https: //doi.org/10.475/3331076.3331089 
Bipin C. Desai: IoT: Imminent ownership Threat IDEAS 2017, 
July 2017, Bristol, UK https://doi.org/10.475/3105831.3105843 
Bipin C. Desai: The State of Data IDEAS 714, July 2014, Porto, 
Portugal http://dx.doi.org/ 10.1145/2628194.2628229 
Bipin C. Desai: Search and Discovery on the Web Fall 
2001 https://spectrum.library.concordia.ca/983874/1/rerevisit- 
2001.pdf 
Bipin C. Desai: The Web of Betrayals In Proc. of 
IDEAS 2018 ACM, New York, NY, USA, 12 pages. 
https: //doi.org/10.1145/3216122.3216140 
Bipin C. Desai, Arlin L Kipling: Database Web Programming 
https: //spectrum.library.concordia.ca/988529/ 
Bipin C. Desai, Arlin L Kipling: Database Web Programming 
https: //spectrum.library.concordia.ca/987312/ 
Bipin C. Desai: WebJournal: Visualization of WebJourney 
August, 1994 https://spectrum.library.concordia.ca/988478/ 
Bipin C. Desai; Stan Swiercz: WebJournal: Visualiza- 
tion of a Web Journey ADL’95 Forum, May 15-17, 
1995. Lecture notes in computer science (1082). pp. 63-80. 
https: //www.springer.com/gp/book/9783540614104 
Kellen Browning: Twitch Suspends Trump’s Chan- 
nel for ‘Hateful Conduct’ June 29, 2020, NY ‘Times, 
https://www.nytimes.com/2020/06/29/technology /twitch- 
trump.html 
Brian X. Chen: Buyers of Amazon Devices Are Guinea 
Pigs. That’s a Problem Ny TImes, June 16, 2021 
https://www.nytimes.com/2021/06/16/technology /personaltech/ 
buyers-of-amazon-devices-are-guinea-pigs-thats-a-problem.html 
Facebook class action lawsuit launched by Vancouver 
woman CBC News - May 30, 2014 http://www.cbc.ca/news/ 
canada/british-columbia/facebook-class-action- lawsuit- 
launched-by-vancouver-woman-1.2660461 
Chris Hauk: Browser Fingerprinting: What Is It and What 
Should You Do About It? Pixel Privacy, May 19, 2021 
https: //pixelprivacy.com/resources/browser-fingerprinting / 
Cecco, Leyland: Google affiliate Sidewalk Labs abruptly 
abandons Toronto smart city project The Guardian, 7 May 2020, 
https://www.theguardian.com/technology/2020/may/07/google- 
sidewalk-labs-toronto-smart-city-abandoned 
CLIQZ: Cliqz story is over April 29, 2020 
https: //cliqz.com/announcement.html 


44 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


[34] 


[35] 


[38] 


[39] 


[40] 


[41] 


42 


43 


44 


[45] 


[46] 


CLIQZ: An independent alternative for the digital era 
https: //cliqz.com/en/home 

Calla Wahlquist: US attacks Australia’s ’extraordinary’ plan 
to make Google and Facebook pay for news The Guardian, 18 
Jan 2021 https://www.theguardian.com/media/2021/jan/19/us- 
attacks-australias-extraordinary- plan-to-make-google-and- 
facebook-pay-for-news 

Damien Cave: A An Australia With No Google? The Bit- 
ter Fight Behind a Drastic Threat NY Times, Jan. 22, 
2021 https://www.nytimes.com/2021/01/22/business/australia- 
google-facebook-news-media. htm] 


Deb Haaland: A My grandparents were stolen from 
their families as children. We must learn about 
this history Th Washington Post, June 11, 2021 


https://www.washingtonpost.com/opinions/2021/06/11/deb- 
haaland-indigenous-boarding-schools/ 

David McCabe: Big Tech’s Next Big Problem Could Come 
From People Like ‘Mr. Sweepy’ NY Times, Feb. 16, 
2021 https://www.nytimes.com/2021/02/16/technology/google- 
facebook-private-antitrust. html 

David MacDonald: Canada’s hypocrisy: Recognizing geno- 
cide except its own against Indigenous peoples The 
COnversation https: //theconversation.com /canadas-hypocrisy- 
recognizing-genocide-except- its-own-against-indigenous-peoples- 
162128 

Diamond, Stephen: Statement From Waterfront Toronto 
Board Chair, May 7, 2020 https: //quaysideto.ca/wp- 
content /uploads/2020/05/Waterfront-Toronto-Statement-May- 


7-2020.pdf 

Dina Srinivasan: Google Is Dominating This Hid- 
den Market With No Rules NY TImes. June 21, 
2021 https://www.nytimes.com/2021/06/21/opinion/google- 


monopoly-regulation-antitrust.html 

Elia Zureik: Israel’s Colonial Project in Palestine: Brutal Pur- 
suit Routledge, Taylor & Francis, 2015, ISBN: 9780415836074N 
Frank, Thomas: The Pessimistic Style in American Politics 
Harper, May 2020, https://harpers.org/archive/2020/05/how- 
the-anti-populists-stopped-bernie-sanders/ 

Greg Bensinger: Google’s Privacy Backpedal Shows Why 
It’s So Hard Not to Be Evil NY Times, June 14, 
2021 https://www.nytimes.com/2021/06/14/opinion/google- 
privacy-big-tech. htm] 

Geoffrey A. Fowler: «iTrapped: All the things Apple won’t 
let you do with your tPhone The Washington Post, May 27, 2021 
https://www.washingtonpost.com/technology/2021/05/27/apple- 
iphone-monopoly/ 


Joe Guinan; Martin O’Neill: Only bold state in- 
tervention will save us _ from a future owned 
by corporate giants The Guardian, 6 Jul 2020, 


https://www.theguardian.com/commentisfree/2020/jul/06/state- 
intervention-amazon-recovery-covid-19 

History.com Editors: The Invention of the Internet Oct 28, 
2019 https://www.history.com/topics/inventions /invention-of- 
the-internet 

Jessy Hempel: What Happened to  Facebook’s 
Grand Plan to Wire the World? Wired, 05.17.2018, 
https://www.wired.com/story /what-happened-to-facebooks- 
grand-plan-to-wire-the-world/ 

Jeff Horowitz; Deepa Setharamman: Facebook Executives Shut 
Down Efforts to Make the Site Less Divisive Wall Street Journal, 
May 26, 2020, https://www.wsj.com/articles/facebook-knows-it- 
encourages-division- top-executives-nixed-solutions-11590507499 


Hogan Libbyin; Safi, Michael: Revealed:  Face- 
book hate speech exploded in Myanmar during 
Rohingya crisis The Guardian, April 3, 2018, 


https://www.theguardian.com/world/2018/apr/03/revealed- 
facebook -hate-speech-exploded-in-myanmar-during-rohingya- 
crisis 

Harari Yuval Noah: Why Technology Fa- 
vors Tyranny The Atlantic, October 2018, 
https: //www.theatlantic.com/magazine/archive/2018/10/yuval- 
noah-harari-technology-tyranny /568330/568330/ 

Ian Peter: The history of email Accessed 2020 
http://www.nethistory.info/History of the Internet /email.html 
Ian Peter: History of the World Wide Web Accessed 2020 
http://www.nethistory.info/History of the Internet/web.html 
John Barber: Canada’s indigenous schools policy was 
’cultural genocide’, says report JThe Guardian, une 2, 


IDEAS 2021: the 25th anniversary 


[55] 


[66] 


[67] 


[71 


[72 


[73 


[74 


Bipin C. Desai 


2915 https://www.theguardian.com/world/2015/jun/02/canada- 
indigenous- schools-cultural-genocide-report 


Julia Carrie Wong: Revealed: the Facebook  loop- 
hole that lets world leaders deceive and ha- 
rass their citizens The Guardian, Apr. 12, 2021 
https://www.theguardian.com/technology/2021/apr/12/ 
facebook-loophole-state-backed-manipulation 

Jamie Irving: Open letter to Prime 
Minister Justin Trudeau June 9, 2021 


https: //www.levellingthedigitalplayingfield.ca/open_letter. html 
Jodi Kantor; Karen Weise; Grace Ashford: The Amazon 
That Customers Don’t See NY Times, June 15, 2021 
https://www.nytimes.com/interactive/2021/06/15/us/amazon- 
workers. html 

Jack Nicas: Raymond 
Censorship, Surveillance and Profits: A Hard Bar- 
gain for Apple in China NY times, May 17, 2021 
https://www.nytimes.com/2021/05/17/technology /apple- 
china-censorship-data.html 

Jason Proctor: B.C. Appeal Court clears way for 
Facebook class action CBC News May 11, 2018 
https://www.cbc.ca/news/canada/british-columbia/facebook- 
sponsored- stories-appeal-courts-1.4659350 

Kenan Malik: Tell me how you’ll use my medical data. 
Only then might I sign up The Guardian 6 Jun 2021 
https: //www.theguardian.com/commentisfree/2021/jun/06/tell- 
me-how-youll-use-my-medical-data-then-i-might-sign-up 
Linden A. Mander Azis Rule in Occupied Europe: Laws of 
Occupation, Analysis of Government, Proposals for Redress 
The American Historical Review, Volume 51, Issue 1, October 
1945, Pages 117-120 https://doi.org/10.1086/ahr/51.1.11 
The Nobel Prize The Nobel Peace’ Prize 
https: //www.nobelprize.org/prizes/peace/1957/summary / 
Luc Rocher; Julien M. Hendrickx; Yves-Alexandre de Mon- 
tjoye: Estimating the success of re-identifications in incomplete 
datasets using generative models Nat Commun 10, 3069 (2019). 
https: //doi.org/10.1038/s41467-019-10933-3 

McKibben, Bill: What Facebook and the Oil Indus- 
try Have in Common The New Yorker, July 1, 
2020, https: //www.newyorker.com/news/annals-of-a-warming- 
planet /what-facebook- and-the-oil-industry-have-in-common 


Zhong; Daisuke Wakabayashi: 


1957 


Farhad Manjoo: Tech’s Frightful Five: 
They’ve Got Us NY Times, May 10, 2017, 
https://www.nytimes.com/2017/05/10/technology /techs- 
frightful-five-theyve-got-us.html 

Paul Mozur: A Genocide Incited on Facebook, With 
Posts From Myanmar’s Military NYTimes, Oct 2018, 


https://www.nytimes.com/2018/10/15/technology /myanmar- 
facebook-genocide.html 

Mali Ilse Paquin: Canada confronts its dark history 
of abuse in residential schools The Guardian, June 6. 
2015 https://www.theguardian.com/world/2015/jun/06/canada- 
dark-of-history- residential-schools 

Micah L. Sifry: Escape From Facebookistan The New Republic, 
May 21, 2018 https://newrepublic.com/article/148281/escape- 
facebookistan-public-sphere 

Michael Sfard: Why Israelt progressives have_ started 
to talk about ‘apartheid’ The Guardian, 3 Jun 2021 
https: //www.theguardian.com/commentisfree/2021/jun/03/ is- 
raeli apartheid-israel-jewish-supremacy-occupied-territories 
Nadia Calvino; Daniele Franco; Bruno Le Maire; Olaf 
Scholz: A global agreement on corporate tax is in sight 
— let’s make sure it happens The Guardian, June 4, 2021 
https: //www.theguardian.com/commentisfree/2021/jun/04/g7- 
corporate-tax-is-in-sight-finance-ministers 

National Centre for Truth and Reconciliation, 2015 What We 
Have Learned, Principles of Truth and Reconciliation ISBN 
978-0-660-02073-0 https://nctr.ca/records/reports/ 

National Centre for Truth and Reconciliation Canada’s 
Residential Schools: The History: Part 1 Origins to 1939 
https: //nctr.ca/records/reports/ 

National Centre for Truth and Reconciliation 
Residential Schools: The History, Part 2 
https: //nctr.ca/records/reports/ 

National Inquiry into Missing and Murdered Indigenous Women 
and Girls WReclaiming Power and Place: The Final Report 
June 3, 2019 https://www.mmiwg-ffada.ca/final-report/ 


W Canada’s 
1939 to 2000 


45 


Colonization of the Internet 


[75] 


[76] 


77 


78 


79 


[80 


—= 


[81] 


NEWS MEDIA CANADA Levelling the Digital Playing Field 
September 2020 https://www.levellingthedigitalplayingfield.ca 
Natasha Singer: Google Promises Privacy With Virus App 
but Can Still Collect Location Data NY Times, July 20, 
2020, https://www.nytimes.com/2020/07/20/technology/google- 
covid-tracker-app.html 

NSO Group/Q Cyber Technologies Over One Hundred 
New Abuse Cases https://citizenlab.ca/2019/10/nso-q-cyber- 
technologies-100-new-abuse-cases / 


IBM The birth of IBM PC 
https://www.ibm.com/ibm/history/exhibits/ 
pce25/pc25_birth. html 

Nicole  Perlroth: WhatsApp Says Israeli Firm 
Used Its App in Spy Program Oct. 29, 2019, 


hhttps://www.nytimes.com/2019/10/29/technology/whatsapp- 
nso-lawsuit.html 

Philip Inman and Michael Savage Rishi Sunak announces 
‘historic agreement’ by G7 on tax reform The Observer June 5, 
2021 https://www.theguardian.com/world/2021/jun/05/rishi- 
sunak- announces-historic-agreement-by-g7-on-tax-reform 
Raymond B. Blake, John Donaldson Whyte Pierre Trudeau’s 
failures on Indigenous rights tarnish his legacy The COnver- 
sation https: //theconversation.com/pierre-trudeaus-failures-on- 
indigenous-rights-tarnish-his-legacy-162167 

Robert N. Charette Canadian Government’s Phoenix 
Pay System an “Incomprehensible Failure”: That’s 
the nicest thing that could be said for a_ deba- 
cle of the first rank IEEE Spectrum, 05 Jun 2018 
https://spectrum.ieee.org/riskfactor/computing/software/ 
canadian-governments-phoenix-pay-system-an- 
incomprehensible-failure 

Kevin Roose: Social Media Giants Support Racial Justice. 
Their Products Undermine It NY Times, 19, June 2020, 
https://www.nytimes.com/2020/06/19/technology /facebook- 
youtube-twitter-black-lives-matter. html 

CBC Facebook ’sponsored stories’ case will be heard 
by Supreme Court The Canadian Press. Mar 11, 2016 
8 https://www.cbc.ca/news/technology /scc-facebook-sponsored- 
stories-1.2660603 

Sheera Frenkel; Mike Isaac: India and Israel Inflame Face- 
book’s Fights With Its Own Employees NY TImes, June 3, 
2021 https://www.nytimes.com/2021/06/03/technology /india- 
israel-facebook-employees.html 

Sam Levin: Is Facebook a publisher? In public it says 
no, but in court it says yes The Guardian, Jul 3, 2018 
https: //www.theguardian.com/technology/2018/jul/02/facebook- 
mark- zuckerberg-platform-publisher-lawsuit 

Sarah Left: Email timeline https://www.theguardian.com/ 
technology /2002/mar/13/internetnews 


History.com Editors Salt Match Jan 16, 2020 
https: //www.history.com/topics/india/salt-march 

Shira Ovide: Conviction in the Philippines Re- 
veals Facebook’s Dangers NY Times, June 16, 2020, 


https://www.nytimes.com/2020/06/16/technology/ facebook- 
philippines. html 


IDEAS 2021: the 25th anniversary 


[90] 


[91] 


[92] 


[93] 


[94] 


[95 
[96 
[97 


[98 


[99 
[100 
[101 
[102 
[103 
[104 
[105 
[106 
[107 
[108 


[109 


[110 
[111 


[112 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


Shira Ovide: China Isn’t the Issue. Big Tech Is NY Times, June 
17, 2021 https://www.nytimes.com/2021/06/17/technology/ 
china-big-tech. html 

Senate Report: The Phoenix Pay Problem: Working To- 
wards a Solution (PDF). Standing Senate Committee on Na- 
tional Finance Report of the Standing Senate Committee 
on National Finance. Ottawa, Ontario. July 31, 2018. p. 34, 
https: //sencanada.ca/content /sen/committee/421/NFFN/ Re- 
ports/NFFN_Phoenix_Report_32_WEB-e.pdf 

Vannevar Bush: As we may think The Atlantic, July 1945 
https://www.theatlantic.com/magazine/archive/1945/07/ as-we- 
may-think/303881/ 


G. Vinton; Robert Cerf; E. Kahn: A Protocol 
for Packet Network  Intercommunication IEEE 
Trans on Comms, Vol Com-22, No 5 May 1974 


https://www.academia.edu/8157148/A_Protocol_for_Packet 
-Network_Intercommunication 

Vega, Nicolas New York Times pulls out of Ap- 
ple News partnershipp NY Post, June 29,. 2020. 
https://nypost.com/2020/06/29/new-york-times-pulls-out- 
of-apple-news-partnership/ 


Wikipedia Videotelephony https://en.wikipedia.org/wiki/ 
Videotelephony 
Wikipedia Austrian Fast India Company 


https: //en.wikipedia.org/wiki/Austrian_East_India_-Company 
Wikipedia Colonization https://en.wikipedia.org/wiki/ Colo- 
nization 

Wikipedia East India Company https://en.wikipedia.org/wiki/ 
East -India_Company 

Wikipedia Dutch Fast India Company 
https: //en.wikipedia.org/wiki/Dutch _East_India-Company 
Wikipedia French East India Company 
https: //en.wikipedia.org/wiki/French _East_India-Company 
Wikipedia Portuguese Fast India Company 
https: //en.wikipedia.org/wiki/Portuguese _East-_India-Company 
Wikipedia Swedish Fast India Company 
https: //en.wikipedia.org/wiki/Swedish _East_India_-Company 


Wikipedia Company rule wn India 
https: //en.wikipedia.org/wiki/Company -_rule_in_India 
Wikipedia History of hypertext 


https: //en.wikipedia.org/wiki/History_of _hypertext 

Wikipedia Hypertext https: //en.wikipedia.org/wiki/Hypertext 

Wikipedia AltaVista https://en.wikipedia.org/wiki/AltaVista 
x 


Wikipedia Window System 
https: //en.wikipedia.org/wiki/X_Window _System 
Wikipedia SoftQuad Software 


https: //en.wikipedia.org/wiki/SoftQuad_Software 

Wikipedia Uyghur genocide https://en.wikipedia.org/wiki/ 
Uyghur-_genocide 

Wikipedia OSI model https://en.wikipedia.org/wiki/OSI_model 


Wikipedia Phoenix pay system 
https: //en.wikipedia.org/wiki/Phoenix _pay_system 
W3C World Wide Web Consortium (W3C) 


https: //www.w3.org 


46 


Data Mining Autosomal Archaeogenetic Data to Determine 
Minoan Origins 


Peter Z. Revesz 
University of Nebraska-Lincoln 
Lincoln, Nebraska, USA 
revesz@cse.unl.edu 


ABSTRACT 


This paper presents a method for data mining archaeogenetic au- 
tosomal data. The method is applied to the widely debated topic 
of the origin of the Bronze Age Minoan culture that existed on the 
island of Crete from 5000 to 3500 years ago. The data is compared 
with some Neolithic and early Bronze Age samples from the nearby 
Cycladic islands, mainland Greece and other Neolithic sites. The 
method shows that a large component of the Minoan autosomal 
genomes has sources from the Neolithic areas of northern Greece 
and the rest of the Balkans and a minor component comes directly 
from Neolithic Anatolia and the Caucasus. 
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- Computing methodologies — Machine learning; Machine learn- 
ing approaches; + Information Systems — Data mining; + Ap- 
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1 INTRODUCTION 


The rapid growth of autosomal archaeogenetic data was motivated 
in recent years by a huge interest in tracking ancient human mi- 
grations. Unfortunately, the development of cutting-edge genetic 
sequencing technologies that have facilitated this rapid growth has 
not been matched with a proportional development of advanced 
data mining methods that would bring out the most useful infor- 
mation from the newly available data. The goal of this paper is to 
introduce a new data mining method that can better answer the 
deeper questions about human migrations. 

In this paper, we show how better data mining of archaeogenetic 
data can illuminate some widely debated questions about prehis- 
tory. In particular, we use the example of the island of Crete. In 
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Crete, there was only a minimal presence of humans during the 
Paleolithic and the Mesolithic periods, when people lived from 
fishing and hunting. Agriculture as well as the keeping of sheep, 
goats, cattle and pigs arrived at Crete from the Near East about 8000 
years ago. This is well-documented in archaeology. The Bronze Age 
started in Crete around 5000 years ago with a civilization that was 
named the Minoan civilization by Sir Arthur Evans, the famous 
British archaeologist who discovered the palace of Knossos in cen- 
tral Crete just a few miles south from the modern city of Heraklion, 
Greece. Even since Arthur Evans’ discovery in the early 20th cen- 
tury, people have wondered about where the Minoans came from. 
Various theories were proposed with no clear answer. Bernal [2] and 
Evans [10] proposed an Egyptian, Gordon [12] and Marinatos [16] 
some Near Eastern, Campbell-Dunn [3] Libyan, Haarmann [13] 
an Old European, and Gimbutas [11] an Anatolian origin of the 
Minoan civilization. Naturally, Crete’s position as an island in the 
middle of the Mediterranean Sea can lead to many proposals. 

One of the motivations of identifying the origins of the Minoan 
culture is to find the linguistically closest relatives of the Minoans. 
For example, if they came from Libya, then they may have spo- 
ken a Fulani language [3], if they came from the Near East, then 
they may have spoken a Semitic language such as Phoenician or 
Ugaritic [12], and if they came from the north, then they may have 
spoken some Pre-Indo-European language such as Basque, Etruscan 
or a Finno-Ugric language [21]. The Minoan culture left behind 
thousands of inscriptions that are considered undeciphered. The 
identification of related languages can be an important step towards 
the decipherment of the Minoan scripts. 

In this situation, many people hoped that archaeogenetic data 
will give a definite answer to the question of Minoan origins. How- 
ever, the archaeogenetic studies were not very conclusive. Moreover, 
they apparently contradict each other in many details. The earliest 
Minoan DNA study was published in 2013 by Hughey et al. [14], 
who examined only mitochondrial DNA (mtDNA). Based only on 
the mtDNA data, they made the following claim: "Our data are com- 
patible with the hypothesis of an autochthonous development of 
the Minoan civilization by the descendants of the Neolithic settlers 
of the island." That means that they thought that the Neolithic set- 
tlers of Crete stayed in place and eventually developed their Bronze 
Age culture. In contrast to Hughey et al. [14], a later mt DNA-based 
data mining study by Revesz [23] suggested a Danube Basin and 
Western Black Sea origin of the Minoans. 

Lazaridis et al. [15] published the first autosomal, whole-DNA 
Minoan data in 2017. Lazaridis et al. [15] wrote the following in 
their conclusion: "Minoans and Mycenaeans were genetically simi- 
lar, having at least three-quarters of their ancestry from the first 
Neolithic farmers of Western Anatolia and the Aegean, and most of 
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the remainder from ancient populations related to those of the Cau- 
casus and Iran." This again suggested that the Minoan civilization 
was developed from the local Neolithic settlers. Recently, Clemente 
et al. [5] succeeded to get more archaeogenetic data from Crete 
and other islands of the Aegean Sea between Greece and Turkey. 
Their main conclusion was that the Minoans had over seventy-five 
percent of ancestry from European Neolithic farmers. 

Clearly, the current archaeogenetic results leave a confusing 
picture. While the genetics researchers have added much valuable 
data to the genome databases in recent years, their analyses seems 
confusing and contradictory in some details. This suggests a need 
for a better data mining of the archaeogenetic data, which we will 
develop in this paper. 

The rest of this paper is organized as follows. Section 2 describes 
some background to the current study and related previous results. 
Section 3 describes our data mining method. Section 4 presents 
some experimental results. Section 5 gives a discussion of the results, 
including a timeline of the Neolithic and Bronze Age migrations 
that are implied by our archaeogentic data mining results. Finally, 
Section 6 gives some conclusions and directions for future work. 


2 BACKGROUND AND PREVIOUS RESULTS 


In this paper, we describe data mining of ancient genomes with a 
special interest in discovering the origins of two Bronze Age civ- 
ilizations. The first is the Minoan civilization that existed on the 
island of Crete, which is shown at the bottom of Fig. 1, and the 
second is the Cycladic civilization that existed on the other islands 
above Crete and between present day Greece and Turkey. Currently, 
the European Nucleotide Archive (ENA) contains two Cycladic and 
nine Minoan autosomal genomes that were added by Clemente et 
al. [5] and Lazaridis et al. [15]. The Cycladic civilization is repre- 
sented by the samples Kou01 and Kou03 from Koufonisia island, 
while the Minoan civilization is represented by the samples 10070, 
10071, 10073, 10074, 19005 from Hagios Charalambos Cave, Crete, 
samples 19129, 19130, 19131 from Moni Odigitria, Crete, and Pta08 
from Petras, Crete. These autosomal genomes will be compared 
with autosomal genomes from other Neolithic samples. Fig. 1 shows 
the location of some of the Aegean samples. 

Admixture analysis is an important method in genetic testing. 
The basic goal of admixture analysis is to find the possible sources of 
a test sample Test. In an admixture analysis, some possible options 
Ref1, Ref2, ..., Refn are given. Essentially, an admixture analysis 
compares the genes of the Test and the corresponding genes of the 
possible sources. The admixture analysis tallies how many times 
each possible source had a corresponding gene that was closest to 
the gene of Test among all of the sources. For example, the result of 
an admixture analysis may be that the Testis composed of 50 percent 
of Ref1, 30 percent of Ref2, and 20 percent of Ref3. Sometimes, the 
possible source population is not just one genome sample but a 
small set of related samples, for example, all the samples from the 
Minoan site of Moni Odigitria. In this case, several corresponding 
genes can be considered from this group with always the best out 
of those being chosen. In practice, the number of possible source 
populations is limited to a small number, because a full admixture 
analysis is a computationally complex task. 
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Clemente et al. [5] did an admixture analysis of some Bronze 
Age Aegean samples from the Cycladic, Minoan and Mycenaean 
cultures as shown in Fig. 2. They grouped these samples according 
to the following time periods: Early Bronze Age (EBA), Middle 
Bronze Age (MBA) and the Late Bronze Age (LBA). Fig. 2 shows 
their main results about how their Aegean samples (Test column) 
can be explained by two hypothetical source populations (Ref1 
and Ref2 columns). As can be seen from Fig. 2, their source pop- 
ulations were the Aegean samples listed as Kou01, Kou03, Mik15, 
Log02, Log04, Mik15, Minoan_Lasithi, Minoan_Odigitria, and Myce- 
naean. In addition, they considered the following source popula- 
tions: some of the Aegean samples themselves, Anatolian Neolithic 
(Anatolia_N), Balkans Late Bronze Age (Balkans_LBA), Caucasian 
Hunter-Gatherer (CHG), Europe Late Neolithic and Bronze Age 
(Europe_LNBA), Eastern European Hunter-Gatherer (EHG), Ira- 
nian Neolithic (Iran_N), Pontic Steppe Early and Middle Bronze 
Age (Steppe_EMBA), Pontic Steppe Middle and Late Bronze Age 
(Steppe_MLBA), and Western European Hunter-Gatherer (WHG). 

Clemente et al. [5] apparently did not consider the European 
Early Neolithic and European Middle Neolithic cultures as possible 
sources for the Aegean samples. Lazaridis et al. [15] also did not 
consider those types of samples. Revesz [23] considered European 
Neolithic mitochondrial DNA samples from the Danube Basin and 
suggested that some migration took place from the Danube Basin to 
Crete during the Bronze Age. Our study extends the earlier works 
by considering autosomal, whole DNA data and include European 
Early and Middle Neolithic samples too. 


3 DATA MINING METHOD 


We have used the G25 admixture analysis package [25]. The G25 
admixture analysis package already represents over a thousand ar- 
chaeogenetic samples or small sets of samples as possible options to 
select for sources. While the previous studies limited their possible 
sources to a few selected ones, we did not preselect any particular 
source and instead considered all available Neolithic autosomal 
data as potential sources. This approach can avoid the different 
conclusions of the previous studies that were due to preselecting a 
very limited number of sources for consideration, while excluding 
hundreds of other possible sources from consideration. 

Our data mining analysis started with the following open-ended 
question: Where did the Neolithic ancestors of the Minoans live? 
This question was completely open-ended, because we did not re- 
strict the set of possible sources that the G25 package could choose 
from. Our approach was to look as widely as possible for potential 
sources among all Old World Neolithic, Mesolithic and Paleolithic 
samples from the G25 database, which was a total of 271 poten- 
tial sources. Then we concurrently tested all of them as potential 
sources for each of the eleven Aegean samples. For each test sam- 
ple, the G25 admixture analysis system listed all the hypothetical 
sources with a decreasing percentage order. Two examples of the 
G25 output results are shown in Fig 3. On the top we see the re- 
sults for the Minoan sample 10070, whose origins are 51.8 percent 
Greek Peloponnese Neolithic, 29.6 percent other Greek Neolithic, 
11.8 percent Caucasus Lowland from Azerbaijan, 6.6 percent Cau- 
casian Hunter-Gatherer from Georgia, and 0.2 percent Pre-Pottery 
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Figure 1: The location of some of the Minoan samples. On the island of Crete, shown on the bottom of the map, the sample 
19127 belongs to Moni Odigitria, and the sample 10070 belongs to the Charalambos Cave. This map was generated based on 
the amtDB database [9] 


Neolithic Culture B (PPNB) from the Levant. Below this there is an- from Hungary, 33.6 percent Caucasus Lowland from Azerbaijan, 
other example regarding the Minoan sample from Petras identified 7.6 percent Linear Pottery Culture (LBK) from Germany, 5.2 per- 
by Pta08. Clearly, this has a very different origin. In fact, its hy- cent Middle Neolithic Alfé6ld Culture from Hungary, 2.6 Neolithic 


pothetical sources are 49.4 percent the Neolithic Starcevo Culture 
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Period Test Ref1 Ref2 Ref3 
EBA Kou01 Anatolia_N CHG 
Kou01 Anatolia_N lran_N 
Kou03 Anatolia_N CHG 
Mik15 Anatolia_N CHG 
Mik15 Anatolia_N lran_N 
Pta08 Mik15 lran_N 
Pta08 Mik15 CHG 
Kou01 Anatolia_N CHG EHG 
Kou01 Anatolia_N lran_N EHG 
Kou03 Anatolia_N _Iran_N EHG 
MBA Log02 Kou01 EHG 
Log02 Kou03 WHG 
Log02 Anatolia_N Balkans_LBA CHG 
Log02* Kou01 Steppe_MLBA 
Log02* Kou01 Europe_LNBA 
Log04 Kou01 Balkans_LBA 
Log04 Kou03 Balkans_LBA 
Log04 Mik15 Balkans_LBA 
Log04 Anatolia_N CHG EHG 
Log04* Anatolia_N Steppe_EMBA 
Log04* Anatolia_N Steppe_MLBA 
Log04* Pta08 Balkans_LBA 
Log04* Pta08 Steppe_MLBA 
LBA Mycenaean Log04 Minoan_Lasithi 
Mycenaean Log04 Minoan_Odigitria 
Mycenaean Anatolia_N Kou03 


Mixture Prop. 


Mixture Prop. 


Peter Z. Revesz 


Mixture Prop. 


Ref1 + SE Ref2 + SE Ref3 + SE p value 
0.75 + 0.03 0.25 + 0.03 0.67 
0.75 + 0.03 0.25 + 0.03 0.90 
0.69 + 0.03 0.31 + 0.03 0.10 
0.84 + 0.03 0.16 + 0.03 0.08 
0.84 + 0.03 0.16 + 0.03 0.07 
0.98 + 0.03 0.02 + 0.03 0.09 
0.99 + 0.03 0.01 + 0.01 0.07 
0.74 + 0.04 0.25 + 0.03 0.01 + 0.02 0.67 
0.74 + 0.03 0.24 + 0.03 0.02 + 0.02 0.88 
0.67 + 0.03 0.25 + 0.03 0.08 + 0.02 0.82 
0.81 + 0.02 0.19 + 0.02 0.07 
0.91 + 0.02 0.09 + 0.02 0.06 
0.22 + 0.05 0.65 + 0.06 0.12 + 0.04 0.08 
0.61 + 0.03 0.39 + 0.03 0.20 
0.56 + 0.04 0.44 + 0.04 0.05 
0.21 + 0.06 0.79 + 0.06 0.08 
0.26 + 0.07 0.74 + 0.07 0.07 
0.21 + 0.06 0.79 + 0.06 0.06 
0.58 + 0.03 0.16 + 0.03 0.27 + 0.02 0.12 
0.53 + 0.03 0.47 + 0.03 0.35 
0.38 + 0.03 0.62 + 0.03 0.13 
0.15 + 0.04 0.85 + 0.04 0.06 
0.44 + 0.03 0.56 + 0.03 0.36 
0.36 + 0.04 0.64 + 0.04 0.35 
0.21 + 0.04 0.79 + 0.04 0.45 
0.37 + 0.09 0.63 + 0.09 0.40 


Figure 2: The admixture analysis results of Clemente et al. [5] about how their Aegean samples’ (Test column) can be explained 


by two hypothetical source populations (Ref1 and Ref2 columns). 


Tepecik-Ciftlik Culture from Turkey, 1.4 percent Pre-Pottery Ne- 
olithic Culture B (PPNB) from the Levant, and 0.2 percent from 
Late Neolithic Malaysia. 


4 EXPERIMENTAL RESULTS 


From all the G25 admixture analysis system outputs, we created a 
table as shown in Fig. 4, where each row represents a culture that 
was a source of one of the eleven Cycladic or Minoan samples. We 
only list those rows that had some non-zero value for at least one of 
the eleven Cycladic and Minoan test samples. In each column, the 
percentages add up to 100 percent, meaning that the hypothetical 
sources are fully accounted. 

In Fig. 4 we present the data by grouping the sources together 
according to regions, using a separate color for each region. African 
sources are shown in yellow, Greek and Macedonian sources are 
shown in dark blue, other European sources, which are almost 
all from the Danube Basin, are shown in light blue, Caucasian, 
Ukrainian, and Russian Siberian sources are shown in orange, while 
Fertile Crescent and Iranian sources are shown in green. 
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4.1 Detection of African Origin 


The African DNA sample is from a hunter-gatherer from the Shum 
Laka rock shelter in Cameroon about 8000 years ago. Although the 
connection is only 0.6 percent for the sample 19129, it suggests that 
as the Sahara dried out, some people moved north into Europe, and 
apparently reached the island of Crete, while another group moved 
south to Sub-Saharan Africa. This explains some of the linguistic 
connections that were found between African and European moun- 
tain names [21]. This is the first time an African genetic connection 
was detected to the Minoans in any study. 


4.2 Minoan groups at Charalambos Cave and 
Moni Odigitria are Distinct 

Fig. 5 shows the clusters that we obtain after finding the averages 

for each location and then separating the low values between 0 and 

34 (sky blue) and high values between 42 and 73 (red). It is now 

apparent that the Cycladic Koufonisi and the Minoan Charalambos 

Cave samples cluster together, while the Minoan Moni Odigitria and 
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Target: GRC_Minoan_Lassithi:l10070 
Distance: 2.4160% / 0.02416030 
Sources: 269 | Cycles: 269 | Time: 0.175 s 


51.8 GRC_Peloponnese_N 

29.6 GRC_N 
11.8 AZE_Caucasus_lowlands_LN 
6.6 GEO_CHG 
0.2 Levant_PPNB 


Target: GRC_Minoan_EBA:Pta08 
Distance: 1.8622% / 0.01862163 
Sources: 269 | Cycles: 269 | Time: 0.182 s 


49.4 HUN_Starcevo_N 

33.6 AZE_Caucasus_lowlands_LN 
7.6 DEU_LBK_KD 
5.2 HUN_ALPc_Tiszadob_MN 
2.6 TUR_Tepecik_Ciftlik_N 
1.4 Levant_PPNB 
0.2 MYS_LN 


Figure 3: Example data output of the G25 admixture analysis system. 


Petras samples cluster together. Clemente et al. [5] and Lazaridis 
et al. [15] did not note any differences between the Charalambos 
Cave and the Moni Odigitria groups. 


4.3 Principal Component Analysis 


Fig. 6 shows a Principal Component Analysis (PCA) of various Ne- 
olithic, Chalcolithic and Bronze Age samples from Anatolia and 
Southeastern Europe that are also used in Fig. 4 with the addition of 
a few Mycenaean samples that we added as extra for comparison. 

The PCA components were automatically generated by a free 
PCA software tool developed for the G25 admixture analysis system 
and available on its github page. The first and the second principal 
components are mapped to the x and the y axes of Fig. 6. Clearly, 
the African, Caucasian, and Fertile Crescent samples are all either 
greater than —0.1 on the x-axis or less than —0.05 on the y-axis 
except some Turkish Neolithic samples, while the Danube Basin 
and Greek samples are the opposite, i-e., less than —0.1 on the x-axis 
or greater than —0.05 on the y-axis except the Corded Ware samples. 
Hence, the Corded Ware samples appear to have a mixed Caucasus 
and Danube Basin ancestry. 

Fig 7 shows a detail of Fig. 6. The detail shows that the (mostly 
northern) Greek Neolithic samples are closest to the Hungarian 
K6r6s early Neolithic samples except the separate group of Greek 
Peloponnese Neolithic samples. The Charalambos Cave (marked 
as GRC_Lassithi_Plateau), the two Cycladic (GRC_Cycladic_EBA) 
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and the Petras (GRC_Minoan_EBA) samples are increasingly below 
and to the right of the Greek Neolithic samples. This suggests a 
migration from the Danube Basin to Greece with the Danube Basin 
signature diminishing with increasingly heavy local admixture in 
the Peloponnese peninsula, the Cyclades and Crete. Some of the 
local population may have come from Anatolia as indicated by the 
locations of the Anatolian Neolithic samples from Kumtepe and 
Tepecik_Ciftlik. 

The five Charalambos Cave samples and the three Moni Odigitria 
samples form two different clusters without any overlap. The Moni 
Odigitria samples are all above the Greek Peloponnese Neolithic 
samples and much above the Charalambos Cave, Cycladic and 
Petras samples. The Moni Odigitria samples are close to several 
Danube Basin Middle Neolithic samples. This suggests another 
migration from the Danube Basin to Crete. 

Finally, the Mycenaean samples are considerably to the right of 
the Minoan samples, suggesting that they are a mixture of Minoan 
and Caucasian origin. 


5 DISCUSSION OF THE RESULTS 


Based on radiocarbon dating the estimated age of the Moni Odig- 
itria samples ranges from 2210 to 1600 BC, while the estimated 
age of the Charalambos Cave samples ranges from 2000 to 1700 
BC [15]. Therefore, the Moni Odigitria samples seem older. The 
Petras sample is even older with an estimated date rage from 2849 
to 2621 BC [5]. Therefore, it is logical that the Moni Odigitria and 
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Figure 4: The Neolithic sources for various Cycladic and Minoan autosomal DNAs according to the G25 admixture analysis 
system. 


the Petras samples cluster together in Fig. 5. This clustering is par- 
tially supported by Fig. 7 where the Moni Odigitria and the Petras 
samples have the same x-axis. However, they have a significantly 
different Caucasus admixture. The Caucasus admixture of the Pe- 
tras sample is based on AZE_Caucasian_lowlands_LN, which is 


sequence of events and illustrate them in Fig. 8. 


(1) The Neolithic agricultural revolution started in the Fertile 
Crescent and spread from there via three routes that are rele- 


located almost directly below the Petras sample. Hence it is likely 
that the Petras sample is lower than the Moni Odigitria samples 
due to the higher AZE_Caucasian_lowlands_LN admixture. 
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vant to our study. The first and the second routes went slowly 
through Anatolia, where the spread of agriculture took two 
different turns. One route continued along the eastern Black 
Sea until the Danube Delta and then followed the Danube 
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Figure 5: The cluster analysis of the four different Aegean locations. 
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Figure 6: A PCA analysis of the archaeogenetic samples in Figure 4. 


as the gateway into Europe. The early European Farmers 
who followed this route established the Old European civ- 
ilization [11]. This culture is divided into various groups 
such as the Vinéa culture, the Korés culture, etc. The second 
route went along the cost of Southern Europe, reaching from 
Greece to the Iberian peninsula. The farmers on this second 


route area are known for their Cardium pottery culture [11]. 


A third route of agricultural expansion expanded toward the 
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Caucasus. Initially, these three groups kept separate from 
each other, but later there was considerable interaction and 
apparent convergence of material culture between these ar- 
eas. Childe[4] describes archaeological similarities between 
the Old European civilization in the Danube Basin and the 
Neolithic Dimini culture in Greece. Revesz [22] traces the 
spread of art motifs that accompanied the early Neolithic 
and the Bronze Age human migrations. Revesz [22] reports 
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Figure 7: PCA detail with the Minoan Charalambos Cave samples, Moni Odigitria samples, and the Greek Peloponnese Ne- 
olithic samples. The dashed circles indicate the clusters of all Minoan Charalambos, all Minoan Moni Odigitria and one group 
of Greek Peloponnese Neolithic samples. 


a particularly strong connection between the late Neolithic 
and early Bronze Age Danube Basin cultures and the Minoan 
culture. 

There was also some early population movement from the 
Caucasus, which has effected the eastern part of Crete the 
most, as evidenced by the high Caucasus lowland ancestry 
of the Petras sample as seen in Fig. 4. 


(2) At the beginning of the Early Bronze Age, about 5000 years 


ago, the Minoan culture was established by people who 
moved from mainland Greece and Macedonia and to a lesser 
extent from the Caucasus lowland area to the Cyclades and 
to Crete as shown in Fig. 8. The EBA migration seems to have 
affected the Cyclades, including the island of Koufonisi, and 
the northern and central part of Crete, including the Char- 
alambos Cave on the Lassithi Plateau, according to the cluster 
analysis in Section 4.2. The PCA analysis suggests that the 
Charalambos Cave samples genetically originated from the 
Greek mainland, especially the Peloponnese peninsula, but 
they mixed with people who came from the Caucasus. 

Concurrently, some EBA migration could have occurred to 
eastern Crete from the Danube Basin area, which was part of 
the Old European civilization [11]. This may explain the high 
concentration of Danube Basin ancestry in the Petras sample. 
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(3) 


The migration to Petras could have been easily done by ship 
starting from the Danube Delta area and sailing through the 
Bosporus Strait and the Dardanelles Strait to the Aegean 
Sea and then following the western cost line of Anatolia as 
shown in Fig. 8. 


Possible Cause: According to Kurgan Hypothesis [11], the 
first groups of Indo-Europeans started to move from the Pon- 
tic Steppe to Central and Southeastern Europe around 5000 
years ago. Hence, it seems that the Indo-European migra- 
tion, and in particular the migration of Mycenaeans to the 
Peloponnese area caused some of the populations to move 
south to the Cyclades and Crete. 


At the beginning of the Middle Bronze Age, about 4200 years 
ago, the Middle Minoan culture was created by another move- 
ment of people to Crete from the Danube Basin area. This 
wave of migration also followed the shipping route shown 
in Fig. 8 but apparently after reaching the eastern end of 
Crete, people sailed along the southern cost of Crete in an 
east-to-west direction before reaching Hagia Triada, a natu- 
ral harbor near Moni Odigitria. This hypothesis is supported 
by the remarkably high Danube Basin admixture and no 
Caucasus lowland admixture in the Moni Odigitria samples 
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Figure 8: Some of the Aegean migration routes implied by the archaeological data mining. 


as shown by both the cluster analysis in Section 4.2 and the 
PCA analysis in Section 4.3. 

In addition, Haiga Triada is the largest source of Minoan Lin- 
ear A documents. According to Revesz [20], the underlying 
language of Linear A belongs to the Finno-Ugric language 
family. Ancient topological names suggest that some people 
in the Black Sea area spoke Finno-Ugric languages before 
the arrival of Indo-Europeans [21]. At the Phaistos palace, 
which is also near Moni Odigitria, the Phaistos Disk was 
found, which shows another form of Minoan writing that is 
related to Cretan Hieroglyphs. These Minoan scripts share 


IDEAS 2021: the 25th anniversary 


the same underlying language with Linear A [17, 18]. The 
AIDA system provides an online, searchable library of Mi- 
noan inscriptions with their translations [24]. Cretan Hiero- 
glyphs, Linear A and the Phaistos Disk script are members 
of the Cretan Script Family, which also includes the Linear B 
script that was used by the Mycenaeans [19]. The adoption of 
Linear A to Linear B was accompanied by a language change, 
where the signs were reinterpreted in the Mycenaean Greek 
language, which resulted in a change of the phonetic values 
of many of the signs. Daggumati and Revesz [6] use convo- 
lutional neural networks to extend the script comparison 
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to other ancient scripts, including the Indus Valley Script, 
which also has boustrophedon writing and a large number 
of allographs [7]. 


Possible Cause: This migration may have been prompted 
by a major climate change event, which is called the 4.2 
kiloyear BP aridification event [8], when many areas of the 
world, including the Danube Basin, became dryer, making 
those areas infeasible for agriculture. Hence, farmers likely 
moved from those agricultural areas to the Aegean islands, 
which provided good fishing opportunities. 


(4) In the Late Bronze Age, around 1450 BC, the Mycenaeans 
conquered Crete. They established Knossos as the main cen- 
ter on Crete and modified the Minoan Linear A writing into 
Linear B, which records the earliest form of Greek language. 


Possible Cause: The Santorini volcanic eruption c. 1600 
BC resulted in a large tsunami that may have destroyed 
many of the Minoan coastal towns and the shipping fleet [1]. 
The volcanic eruption likely did not affect the Mycenaeans 
as negatively as the Minoans because while the tsunami 
hit northern Crete, it avoided the Argolis area, where the 
Mycenaeans lived. Hence the Mycenaean civilization could 
recover faster than the Minoan civilization after this catastro- 
phe. As a result, the Mycenaeans became relatively stronger. 
Their surviving ships may have taken over some of the com- 
mercial activities of the Minoans when the latter’s fleet was 
destroyed. 


6 CONCLUSION AND FUTURE WORK 


Most of the farmers in the Danube Basin had ancestry from Ne- 
olithic Anatolia. We are the first to test autosomal DNA for source 
ancestry from both populations simultaneously. The finding that 
the Neolithic Danube Basin is closer to the Moni Odigitria and 
Petras samples than to Neolithic Anatolian samples is surprising 
and contradicts the earlier hypothesis of Lazaridis et al. [15], which 
we quoted in the introduction. It also clarifies the confusion that 
Clemente et al. [5] introduced by grouping these two groups to- 
gether into a "Neolithic European" category. 

Archaeogenetic data mining cannot give a full answer to the 
question of what language was spoken by various groups in South- 
eastern Europe. Our archaeogenetic study suggests that the Early 
Minoan language may be related to some language spoken in the 
Greek Peloponnese peninsula around 5000 years ago, and the Mid- 
dle Minoan language may be related to some language spoken in 
the Danube Basin area about 4200 years ago. The linguistic identifi- 
cation of the underlying language of Minoan Linear A as a Finno- 
Ugric language [20] tends to support our autosomal data mining 
results because part of the Danube Basin and the western Black Sea 
areas likely were Finno-Ugric language areas before the arrival of 
the Indo-Europeans. Some Neolithic Danube Basin archaeological 
sites contain undeciphered inscriptions in a script that resembles 
the Linear A script. That raises the possibility that the Linear A 
script was brought to Crete from the Danube Basin. However, more 
research is needed regarding the relationship between the Danube 
Basin script and the Minoan scripts. 
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ABSTRACT 


DBMS performance is dependent on many parameters, such as index 
selection, cache size, physical layout, and data partitioning. Some 
combinations of these parameters can lead to optimal performance 
for a given workload but selecting an optimal or near-optimal com- 
bination is challenging, especially for large databases with complex 
workloads. Among the hundreds of parameters, index selection is 
arguably the most critical parameter for performance. We propose a 
self-administered framework, called the Multiple Type and Attribute 
Index Selector (MANTIS), that automatically selects near-optimal in- 
dexes. The framework advances the state-of-the-art index selection 
by considering both multi-attribute and multiple types of indexes 
within a bounded storage size constraint, a combination not previ- 
ously addressed. MANTIS combines supervised and reinforcement 
learning, a Deep Neural Network recommends the type of index for 
a given workload while a Deep Q-Learning network recommends 
the multi-attribute aspect. MANTIS is sensitive to storage cost con- 
straints and incorporates noisy rewards in its reward function for 
better performance. Our experimental evaluation shows that MAN- 
TIS outperforms the current state-of-art methods by an average of 
9.53% OphH@size. 
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1 INTRODUCTION 


The performance of a database application is critical to ensuring 
that the application meets the needs of customers. An application’s 
performance often depends on the speed at which workloads, i.e., 
a sequences of data retrieval and update operations are evaluated. 
A database can be tuned to improve performance, e.g., by creating 
an index or increasing the size of the buffer. Figure 1(b) shows the 
impact of just two configuration parameters, Memory Size and 
Buffer Size, on workload performance. We observe that choosing 
an optimal combination can enhance performance significantly. 
Currently, a database administrator (DBA) manually tunes con- 
figuration parameters by monitoring performance over time and 
adjusting parameters as needed. Nevertheless, the growing number 
of configuration parameters, as shown in Figure 1(a), has increased 
the complexity of manually tuning performance. 

For many queries, e.g., range and lookup queries, a database 
index significantly reduces query time. An index can be created on 
a single column or several columns of a database table. There are 
also different types of indexes, for instance, Postgres has six index 
types: B-tree, Hash, GiST, SP-GiST, GIN and BRIN. One possible 
solution is to create all possible indexes for a database. However, 
this approach is infeasible due to a large number of potential in- 
dexes, e.g., for a single kind of index (B-tree) on a table with N 
columns, there are 2% — 1 potential (multi-column) indexes. There 
are two other reasons why it is crucial to limit the number of in- 
dexes in a database. First, there are space considerations. An index 
occupies space in secondary storage that increases the (stored) size 
of a database. Second, indexes slow down data modification since 
modifications need to update both the data and the index. Creating 
too many indexes can decrease the throughput and latency of a 
database. Hence the set of indexes created for a database should be 
parsimonious, while too few indexes may slow query evaluation, too 
many may increase space cost and slow down data modification. 

The index recommendation problem can be defined as finding a set 
of indices that minimizes the time taken to evaluate a workload and 
the amount of storage used. Finding a set of indexes that minimizes 
the cost is a combinatorial optimization problem, and adding a disk 
size constraint makes this problem a constrained combinatorial op- 
timization problem. Finding an optimal solution for such a problem 
is NP-hard [22]. Recently with the advancement of Reinforcement 
Learning (RL) [13], it is being utilized to find approximate solutions 
for large scale combinatorial optimization problems such as for 
vehicle routing [19], directed acyclic graph discovery [37], and the 
traveling salesman problem [16]. RL has also shown that it can 
learn complex database tasks with an ample search space, such as 
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Figure 1: Figure (a) shows number of parameters in Postgres (2000-2020). Figure (b) shows effect on a workload performance 


using different parameter settings 


learning optimal parameter configuration for a workload [26], and 
join order selection [17, 33]. 

In this paper, we apply RL to maximize query performance by 
learning which indexes would be the best to create for a given work- 
load. There has been previous research on index recommendation 
using Reinforcement Learning [1, 2, 9, 14, 24, 31]. However, previous 
research has been limited in one of the three ways. First, previous 
research has focused on one type of index, B-Tree, but DBMSs typi- 
cally support several types,e.g., Postgres has six types. Second, most 
previous research has investigated creating only single attribute 
indexes, but multi-attribute indexes can improve the performance 
of many queries. Third, previous research has not considered a con- 
straint on storage size, that is, the approaches are non-parsimonious 
and allow the creation of more indexes than needed (there is no 
penalty for creating too many indexes). A generic framework that 
captures multi-attribute and different types of indexes is an open re- 
search problem. We propose an end-to-end index recommendation 
system that we call the Multiple Type and Attribute Index Selec- 
tor (MANTIS). MANTIS can learn to recommend multi-attribute 
indexes of different types within a given size constraint. 

This paper makes the following contributions. 


e We formulate the Index Recommendation Problem as a Mar- 
kovian Decision Process. We design our reward function using 
disk size constraint to limit the total index size. 

e We propose an end-to-end multi-attribute and multi-index 
type recommendation framework. Our framework, MANTIS, 
uses Deep Neural Networks and Deep Q-Learning Networks 
for recommendations. 

e We perform extensive experiments on MANTIS and compare 
results with current state-of-the-art methodologies on two 
different datasets. 


This paper is organized as follows, the next section presents 
related work. Section 3 gives a precise formulation of the index 
recommendation problem while Section 4 describes solution to the 
problem using MANTIS. We present the evaluation of the perfor- 
mance of MANTIS in Sections 5 and 6. Section 7 presents conclu- 
sions and future work. 
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2 RELATED WORK 


In this section, we discuss related work in the field of index recom- 
mendation using reinforcement learning. We identify the limita- 
tions of previous work and outline open area of research that this 
paper addresses. 

A database can be tuned using either external or internal meth- 
ods [7, 11, 21, 30]. An external tuning method uses an API to config- 
ure a DBMS, while an internal method embeds tuning algorithms 
in the DBMS. External tuning is widely preferred because it is 
generally applicable. Internal tuning needs access to DBMS inter- 
nals, which may be proprietary or dependent on a particular DBMS 
software architecture. Most of the approaches discussed here are ex- 
ternal. Internal tuning is primarily industry-based, e.g., the Oracle9i 
optimizer. 

In recent years, Reinforcement Learning (RL) has become a pop- 
ular external tuning method and has been used to optimize join 
order [17, 29, 33], in query optimization [8, 18, 20], for query sched- 
uling [34], to self-tune databases [12, 35, 36] and to improve data 
partitioning [3, 5, 32]. For the index selection problem, Basu et al. [1] 
proposed a tuning strategy using Reinforcement Learning. They 
formulate the index selection problem as a Markovian Decision 
Process and use a state-space reduction technique to scale their 
algorithm for larger databases and workloads. Sharma et al. [26] 
proposed NoDBA for index selection. NoDBA stacks the workload 
and potential indexes as input to the neural network and uses Deep 
Reinforcement Learning with a custom reward function for index 
recommendation. Both of the above approaches consider only sin- 
gle attribute indexes and a single kind of index. They are unable to 
recommend multi-attribute indexes. Welborn et al. [31] introduced 
latent space representation for workload and action spaces. This 
representation enables them to perform other tasks, e.g., workload 
summarization and analyzing query similarity. They used a variant 
of DON with dueling called BDQN (Branched Deep Q-Network) 
for learning index recommendation. Licks et al. [14, 15] introduced 
SmartIX where they use Q-Learning for index recommendation. 
In their approach, they learn to build indexes over multiple ta- 
bles in a database. They also evaluate SmartIX using the standard 
metrics OphH@size, power @size and throughput @size. They use 
OphH@size in their reward function, which makes the evaluation 
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process slow, and in some cases, it may take several days of com- 
putation, which impacts the scalability of their approach. Kllapi et 
al. [6] propose a linear programming-based index recommenda- 
tion algorithm. Their approach identifies and builds indexes dur- 
ing idle CPU time to maximize CPU utilization without affecting 
performance. Lan et al. [9] propose an index advisory approach 
using heuristic rules and Deep Reinforcement Learning. Sadri et 
al. [24] utilizes Deep Reinforcement Learning to select indexes for 
cluster databases. The previous work [9] and [24] evaluated the 
performance of a selected index by observing reduction in query 
execution cost. We believe that evaluating indexes based on execu- 
tion cost is not ideal because it only measures the increase in read 
speed. The recommended indexes for modern databases must also 
be evaluated for durability (power @size) and processing capability 
(throughput @size). The above approaches have primarily focused 
on recommending B-tree indexes. By focusing on a single type of 
index, such approaches lack the capability of utilizing other types 
of indexes like BRIN or Hash to improve query performance. There 
are no previous approaches for performing an end-to-end index 
recommendation for both multi-attribute and multi-type index se- 
lection, which is the focus of this paper. Moreover, most previous 
approaches do not support a storage space constraint. 


3 PROBLEM FORMULATION 


The index recommendation problem is to select a set of indexes 
that minimizes the time to evaluate a workload and the amount of 
storage needed. 

A workload W is a set of SOL queries Q1, Q2,.., Qm. An index 
configuration I is a set of indexes. We calculate the cost of workload 
evaluation on database D using the cost of evaluating each query 
given by Cost(Q;, I, D). The cost of a workload can be described as 
follows. 


m 
Cost(W, I, D) = >, Cost(Qj;, I, D) 
j=l 


Note that the workload cost does not weigh queries differently, 
though we could trivially include such weights by replicating indi- 
vidual queries in a workload. The index selection problem is to find 
a set of indexes [optima that minimizes the total cost of workload 
Cost(W, I, D) and has a storage cost of at most C. 


loptimal = Saale I’, D) 


In this equation, S(J*) is the total storage space cost of the set of 
indexes I*. 


4 MANTIS FRAMEWORK 


We designed and built our framework using Deep Neural Networks 
(DNN) and Deep Q-Learning Networks (DQN). In order to select 
a suitable set of index types for a workload, our first research 
goal is index type selection. The second research goal is the index 
recommendation to pick possible (single/multiple) attributes for 
the index. Our framework, research goals, and the challenges we 
overcome are described in this section. 
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Figure 2: DNN training and testing performance 


4.1 Index Type Selection 


Almost all DBMSs have several types of indexes. Though the types 
vary across DBMSs, common index types include B-tree index 
(which includes all B-tree variants and is often the default index 
type), block range index (BRIN), which is helpful in range queries, 
and hash index, which improves the performance of hash joins 
and point queries. Spatial indexes, such as R-tree indexes, are also 
commonly available. We trained a Deep Neural Network (DNN) to 
choose the best types of indexes for a given workload. The DNN 
models take workload as an input and predict potential index types. 

DNN model: We convert SQL queries in a workload to a vector 
representation using feature extraction. Specifically, we extract 
features describing different query types for B-Tree, BRIN, Hash, 
and Spatial. 


e feature_1: Describes the count of each operator used in a 
query. The operators we search for are [>,<,=,<,>]. This fea- 
ture helps us identify queries searching for equality, range, 
or general. In case of equality, a Hash index would be prefer- 
able, and for queries based on [<,>] a BRIN index would be 
preferred over B-Tree. 
feature_2: The number of columns mentioned in a query. 
feature_3: The number of conjunctions/disjunctions 
[‘and’;or’] in a query. 
e feature_4: To identify spatial queries we extract certain 
keywords from the query [‘.geom’, ‘.location’, ‘distance’]. 


Using the above features, we pre-train our DNN model. We 
use a fully connected multi-layer network with three layers. The 
first and second layers consist of 32 neurons with relu activation, 
and the output layer consists of num_classes (in our case 4) with 
sigmoid activation. The mean squared error (mse) is used as the cost 
function; the number of epochs is 30 and 30% data for validation. 
The model is trained using adam optimizer with a learning rate 
of 0.001, and we use a learning rate decay (initial rate/epochs) 
for the stability of the network. We use data generated by our 
TPC-H random query generator to pre-train. The training dataset 
comprises 500 queries each for four different types of indexes in 
total to 2000 queries. The process of generating queries is explained 
in a later Section 5. We observe the loss and accuracy of our DNN 
model to be stable, and Figure 2 displays DNN model performance. 


4.2 Index Recommendation 


Index recommendation in MANTIS uses Deep Neural Networks 
for function approximation with the Q-Learning algorithm, also 
known as the Deep Q-Learning Network (DQN). The indexes are 
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Figure 3: MANTIS framework using index type selection, and index recommendation. 


interdependent, and this dependency can benefit performance. Such 
interdependency can be utilized for selecting a set of optimal in- 
dexes. Using the notion of interdependence, we formulate the index 
recommendation problem as a Markovian decision process (MDP). 
We extract relevant information from a database to define a state, 
action, and a state; to action: mapping at time t. We represent 
a state using existing indexes in a system and an action using a set 
of all possible indexes. Both the state and action are determinis- 
tic. In a conventional MDP, every action is mapped to a state and 
an associated reward. The goal of MDP is to reach the final state, 
maximizing cumulative rewards and identifying a policy, which 
is a state-action mapping that selects the appropriate action at a 
given state. The two fundamental methods to solve MDPs are value 
iteration and policy iteration. Value iteration uses the value of a state 
that quantifies amount of future rewards it can generate using the 
current policy, also known as expected return. Policy iteration uses 
the policy of a state-action pair that signifies the amount of current 
reward it can generate with an action at a state using a specific 
policy. 

MANTIS uses a variant of value iteration called Q-Learning. 
Q-Learning is a value-based, temporal-difference reinforcement 
learning algorithm. In a Q-Learning algorithm, the agent learns 
from the history of environment interactions. Q-Learning uses an 
average of old and newer observations to update neural networks. 
It reduces temporal dependence due to random data sampling for 
training, leading to faster convergence. Our framework can also 
learn in a constrained environment, in our case, a storage size 
bound. The constraint is generic and could be used in conjunction 
with other variables, e.g., buffer size, memory size. 

State and Action: An agent interacts with a state for the learn- 
ing process. A state is the representation of the environment. In 
a state representation, the information given is available to the 
learning agent. It should be capable of accurately explaining the 
environment. We use a potential set of indexes as the state repre- 
sentation. The action space represents all possible actions that can 
be performed. We calculate the number of possible actions Nactions 
using the following equation: 


Nactions = (2Neotumns — 1) x len(index_type) 
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where, index_type is our DNN model predicting the possible index 
types to be used and (2Ncotumns — 1) is all possible combination of 
indexes in a table. 

Reward Function: A reward function is another vital compo- 
nent of a reinforcement learning agent. We design our reward 
function based on workload cost estimation. We compare the cost 
of the index set against the set of all possible indexes as defined 
below: 


1, index_size < max_allowed_size 
—1, otherwise 


rewardSsize = | 


no_index_cost 


rt = max | ———_—__—_____ -- 1, 0] + rewardss; iT 
. nae | size (1) 


where, the denominator is the workload cost with a selected set of 
indexes, and the numerator is workload cost with no indexes. We 
also use a reward for the storage size constraint. 

Noisy Rewards: Inconsistent rewards can cause an agent’s per- 
formance to degrade, and a learning agent will not maximize re- 
wards. Though we use query cost estimate as a reward, previous 
research has shown the inefficacy of DBMS cost estimator [10]. 
To minimize the noise in the reward function, we perform a 1D 
convolution filter with a kernel size of five and use the filter in our 
reward function. Given an input vector, f of size k and convolution 
kernel g of size m, a 1D convolution filter can be formulated as 
follows: 


(f «g)() = > (Gli): FG - i+ m/2)) /m 
i=1 


We experimented with several well-known noise-reducing filters, 
namely Exponential filter, Savitzky-Golay filter, and Kalman Filter. 
We found that these filters either over or underestimated rewards. 
The 1D convolution filter also tends to underestimate the cost. 
However, due to its fast computation time and streaming nature, 
it proved to be a feasible solution. We also added a spike filter to 
suppress extreme values. 

DQN with Priority Experience Replay: The Priority Experi- 
ence Replay (PER) has proven to be a very effective improvement 
over traditional sampling strategy in a DQN algorithm [25]. Rather 
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Algorithm 1 DQN with Priority Experience Replay > (with 
hyperparameters initial values) 
1: Initialize batch size Ns > 32 
2: Initialize number of Iterations I, > 100 
3: Initialize length of a episode L, > 3-6 
4: Initialize update target network frequency F, 
5: Initialize priority scale 7 > 1.0 
6: Initialize priority constant € > 0.1 
7: Initialize network parameter 0 
8: Initialize target network parameter 0” 
9: Initialize learning rate a > 0.001 
10: Initialize discount factor f > 0.97 


11: for] € L, do > number of episodes 


12: Collect experiences (s, a, r, s’, p) > until minimum N, size 
13: for i € I, do 

14: Sample N, prioritized samples from buffer 

15: for n € Ns do 

16: yi=rit pmax ca Q@ (si, a5) 

17: if done == True then > check for terminal state 
18: 6; = |y; — Qa (si, ai) | > calculate TD Error 
19: end if 

20: end for 

21: L(g) = NG ¥; (yi -— Qe(si, ai))? > calculate loss MSE 
22: 0=0-a Vo L(A) > update network parameters 
23: pi= She? > calculate and update samples priority 
24: end for 

25: if 1 mod F, then 

26: =80 > update target network 
27: end if 

28: end for 


than uniform sampling, PER weighs the sample such that the sam- 
ple producing high error will be drawn more frequently in training. 
The PER helps in reducing the overall bias and improved the per- 
formance of the network. We compute PER based on Temporal 
Difference (TD) error. The TD error (6) is computed using equation 
below: 
di = lyi — Qo (si, ai)| 

where, yj; is the target and Qg(sj, aj) is the estimate. The target (y;) 
is computed using Bellman’s optimality equation [27], as shown 
below. 


yi = rit pmax Qe (s;, a;) 
L 


where, rj is the computed reward, max’ <A Qg (s;,a;) is the future 
rewards, and f is the discount factor. The priority (p;) of the samples 
are computed using TD error by: 

5 = —Llsil +6)" 
©, (i+ 6)" 
To learn the weights of Neural Network parameters in DON, we 
use stochastic gradient descent on MSE Loss L(@) function. 

DQN Agent training: The training procedure consists of 
Nepisodes €pisodes, and each episode has Ningex steps where Ningex 
represents the maximum number of indexes. During an episode, 
the agent performs an action based on a policy and collects rewards 
for the selected action, known as an experience. These experiences 
are stored in a buffer, called the Replay Buffer, for sampling (priority 
based) and training. The cost of an action is calculated by its effect 
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on the total workload cost of retrieval using the rewards function 
from Equation 1. The computed rewards are adjusted using the 1D 
convolution filter. At the end of this step state, action, rewards are 
returned. During training we use the a configuration batch size of 
32, consisting of 100 episodes, a maximum index number of Ningex: 
3—6, a priority scale of 7: 0.97, a learning rate of a: 0.001, a discount 
factor of f: 0.97, and a storage bound of 10-20MB. Our complete 
algorithm with hyperparameter values is shown in Algorithm 1, 
and the framework is depicted in Figure 3. 


5 EXPERIMENTAL SETUP 


We perform experiments on two datasets, a standard database 
benchmark TCP-H [28] and a real time dataset IMDB [10]. We use 
PostgreSQL as a choice for our database. We create an OpenGym 
environment for command-based database interaction. All experi- 
ments were performed on a computer with Intel i7 5820k, Nvidia 
1080ti, 32GB of RAM running Ubuntu 18.04 OS. We use Python 3.7 
and libraries (powa, gym, psycopg2, sklearn, TensorFlow, Keras) to 
write and train our framework. The DNN and DON were trained 
on Nvidia 1080ti (3584 CUDA cores and 11GB DDR5 RAM) with 
CUDA and cuDNN configured for performance enhancement. 


(1) TPC-H: TPC is the most well-known and most widely-used 
family of database benchmarks. TPC-H is the benchmark for deci- 
sion support systems. There exists a set of 22 TPC-H query tem- 
plates. The set of query templates are majorly used for benchmark- 
ing database systems. However, they are not specific for index 
selection benchmarking. With this in mind, we generate a dataset 
of 120k tuples using TPC-H and randomly generate queries as fol- 
lows: we randomly select columns and a value from a table. We then 
randomly select an operator [>, <, =] and predicate [and, or]. We 
create four different sets of queries by randomly selecting between 
one and four columns (1C, 2C, 3C, 4C). The variety of queries as- 
sists in the validation of our framework’s efficacy. We generate 100 
queries for each TPC-H experiment. We use only 120k rows in this 
experiment. In our future work, we plan to focus on larger-scale 
index selection. Our purpose of generating random queries from 
TPC-H is to simulate different types of environments and observe 
the performance of MANTIS compared to baseline. This experiment 
uses single column (1C) and multi-column indexes (2C, 3C, and 4C) 
simulation. 

Few randomly generated queries used in our experiments: 


1C: SELECT COUNT(*) FROM LINEITEM WHERE L_TAX < 0.02 

2C: SELECT COUNT(«) FROM LINEITEM WHERE L_ORDERKEY < 
11517219 OR L_TAX < 0.02 

3C: SELECT COUNT(+) FROM LINEITEM WHERE L_SUPPKEY < 18015 
AND L_PARTKEY > 114249 AND L_TAX > 0.06 

4C: SELECT COUNT(+) FROM LINEITEM WHERE L_ORDERKEY = 8782339 
AND L_TAX = 0.01 AND L_PARTKEY = 264524 AND L_SUPPKEY > 
14028 


(2) IMDB: IMDB is an extensive database of movie-related data. 
There are 21 tables, with a few large tables such as cast_info table, 
which has 36 million records, and the movie_info table, which has 
15 million records. It has 113 computationally intensive SQL queries 
with multi-joins. We randomly and equally divide the queries into 
three stages with 37 (Stage 1), 38 (Stage 2), and 38 (Stage 3) queries. 
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Figure 4: Power, Throughput and QphH of index recommendation methods and there comparison in different scenarios on 


TPC-H dataset. 


The aim of separating queries is to create more real-time scenar- 
ios for framework validation. We evaluate our framework on all 
three stages by selecting indexes and comparing performance with 
baseline. 


5.1 Performance Metric 


We measure performance using standard DBMS metrics, such as 
Power@size, Throught put @Size, and QphH@Size. 

Power@Size tests the durability of the indexes chosen for inclu- 
sion and deletion of documents throughout the database. It includes 
a variety of steps, including (1) a refresh function RF1 that inserts 
0.1% of the table’s data, (2) the execution of a single stream of 
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queries, and (3) the time taken by the RF2 refresh feature, which 
deletes 0.1% of the table’s records at random. The following equa- 
tion is used to calculate the metric: 


3600 
x Scale Factor 


pRF(i)) 


where, Ng is the number of queries, ET(i) is execution time for 
each query i, RF(j) is the time taken by the two refresh functions, 
and Scale Factor is the factor of database size used from TPC-H 
and IMDB. 

Throughput@Size measures the processing capability of a sys- 
tem (disk I/O, CPU speed, memory bandwidth, BUS speed, etc.). It 


Power@Size = 
Nala 4H! (i) = (aé_ 
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Figure 5: Power, Throughput and QphH of index recommendation methods and there comparison in different scenarios on 


IMDB dataset. 


is computed using the equation below: 
Query Stream X Ng 
Total Time 

where, Query Stream is the number of query streams (for our ex- 
periments we used only a single stream) and Total Time is the time 
taken to execute all queries for all streams. 

QphH@Size measures multiple aspects of database performance. 
It measures the query processing power and throughput when 
queries are from multiple streams. The Query-per-Hour Perfor- 
mance Metric (OQphH) is calculated using Power@Size and Through- 
put@Size, as shown below: 


OphH@Size =  Power@Size X Throughput @Size 


Experimental Design: In the TPC-H 1C with only a single at- 
tribute index selection, there are 16 states (number of columns from 
LINEITEM). We randomly select four columns for multi-attribute 
index selection, and for 2, 3, and 4 column indexes, there are 15, 
80, and 255 states, respectively. For the IMDB dataset, we use all 
tables and columns in index selection. We use only single column 
indexes, and the state space consists of 108 states. The number of 
states is crucial for the initialization of action and state space of the 
RL agent. The baseline is evaluated on the database with identical 
records and workload. 


Throughput@Size = x 3600 x Scale Factor 
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5.2 Baselines 


There are several recent approaches for index selection using re- 
inforcement learning [9, 14, 24]. These methods, in general, uti- 
lize Deep Q-Networks with Priority Experience Replay in their 
approach. However, only SmartIX [14] evaluated its performance 
with other existing and state-of-the-art methods. The other meth- 
ods [9, 24] do not compare there performance with any other state- 
of-the-art methods. Moreover some [9, 24] approaches measure 
there performance only based on reduction in query execution 
time. The selected indexes can not be justified as optimal by such 
an evaluation. Since in modern database system indexes are updated 
frequently (updates and writes) and a better performance evaluation 
is performed by evaluating latency and throughput of indexes using 
a standard database metric Power@Size, Throughput@Size, and 
OphH@Size. With these reasons in consideration, we select Smar- 
tIX [14] as one of the baselines. Overall, We select two enterprise- 
based solutions and two index selection using RL, which we re- 
implemented based on information provided in their research pa- 
pers. 

POWA [23]: The PostgresSQL Workload Analyzer is an opti- 
mization tool for PostgresSQL. It collects various statistical data 
from a database and suggests indexes to optimize the workload. 
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Figure 6: Workload execution time with respect to selected indexes on IMDB dataset using MANTIS and baselines. 


EDB [4]: EnterpriseDB’s Postgres advanced server is designed to 
customize, tune, and handle massive Postgres database deployments. 
As a benchmark, we use their index advisor tool. 

NODBA [26]: We re-implement NODBA based on details from 
paper. The authors use DQN for index recommendations. They use 
query and index configuration as input to the Neural Network. 

SMARTIX [14]: We re-implement SMARTIX based on details 
provided in the paper. The authors use QphH@Size as a reward 
function and Q-Learning for selecting indexes. 

All Index: We create all possible indexes for the database and 
use that as a benchmark. For experiments with IMDB, we do not 
use All Index due to the large index space size and also NODBA, 
it does not support multi-table index selection. 


6 RESULTS 


In this section, we compare and discuss the performance of MANTIS 
with baselines on TPC-H and IMDB datasets. We evaluate our 
framework on different scenarios specifically, single attribute index, 
multi attribute index, and index selection on real-time. We also 
measure and compare the workload execution time. 


6.1 TPC-H 


(1) First Scenario (1C - single attribute indexes): We calculate 
Power@Size, Throughput@Size and QphhH@Size for all of the 
baseline systems. We observe comparable performance among most 
of the baseline and MANTIS. Specifically, SMARTIX performs the 
best in this setting, followed by POWA and EDB. Most of the base- 
lines are designed and tuned specially for single index selection 
scenarios. 

(2) 2C, 3C and 4C Scenario (multi-attribute indexes): We use 
both one and two columns indexes for 2C. We observe that MAN- 
TIS performs best with 17.7% QphH improvement to the second 
baseline. We use 1, 2, and 3 column indexes for the 3C scenario. Our 
framework shows a 12.1% QphH improvement to the second base- 
line. We use 1,2,3 and 4 columns for the 4C scenario. We observe 
a 4% OphH improvement of our framework over the best baseline. 
All the results are shown in Figure 4. 


We observe that our framework outperforms other baselines on 
the TPC-H dataset in most (3/4) of the scenarios. Next, we measure 
the performance on the IMDB dataset. 
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6.2 IMDB 


We observe MANTIS outperformed other baseline systems in all 
three stages, as shown in Figure 5. Specifically, there is 3.19%, 2.9%, 
and 17.3% of QphhH improvement to the best baseline at Stage 1, 
Stage 2, and Stage 3, respectively. Our framework can learn complex 
index selection where other baselines struggle. Overall, we observe 
that our framework outperforms other baselines (3/4) on TPC-H 
and IMDB (3/3) datasets. The runtime of MANTIS for TPC-H is 
about 2 hrs and for IMDB is about 6 hrs. 

To better understand the results from IMDB, we design an ex- 
periment to answer: how effective are selected indexes? Ideally, we 
would like to observe a drop in the workload’s overall execution 
time when indexes are created. We execute the benchmark and 
measure performance after every index creation. The results are 
shown in Figure 6. The index selected using MANTIS took the least 
time in all stages. We also observe that the first index selected by 
MANTIS is optimal in all stages. There is also a steady reduction in 
workload execution costs, which is ideal. 


7 CONCLUSION AND FUTURE WORK 


This paper presents MANTIS, a framework to recommend indexes 
for enhancing the efficiency of a query workload. We propose an 
end-to-end framework, MANTIS, for index recommendation in a 
database. Our implemented framework uses a Deep Neural Net- 
work for index type selection and a Deep Q-Learning Network 
algorithm for multi-attribute index recommendation. Compared to 
previous methods, MANTIS can learn and propose single-attribute, 
multi-attribute, and multi-type indexes. We evaluate MANTIS with 
four other state-of-the-art methods using two standard benchmark 
datasets. We use standard DBMS performance metrics Power @Size, 
Throughput@Size and QphH@Size for evaluation. The experi- 
ments show that MANTIS can significantly outperform (6/7 cases) 
the state-of-the-art index recommendation methods. 

In the future, we aim to build a self-managing database utilizing 
Machine and Reinforcement Learning techniques. We intend to 
carry out our objective in stages. In the first stage, we plan to ex- 
tend our work MANTIS by adding more configuration parameters 
and evaluating our framework on larger and real-time environ- 
ments. This future work is the first stage of evaluating Machine 
and Reinforcement Learning techniques to autonomously configure 
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a database at a large scale. In the second stage, we plan to apply 
machine learning to optimally modify or extend a set of indexes in 
response to a changing workload, in essence performing incremen- 
tal index selection. In the third stage, we plan to extend incremental 
index selection to online selection of other configuration param- 
eters. Our objective in each stage is to move one step closer to a 
self-managed database. Another direction of our future work is to 
study temporal index selection for temporal databases, which is an 
open problem. 
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ABSTRACT 


With advancements in technology, huge volumes of valuable data 
have been generated and collected at a rapid velocity from a wide 
variety of rich data sources. Examples of these valuable data include 
healthcare and disease data such as privacy-preserving statistics 
on patients who suffered from diseases like the coronavirus dis- 
ease 2019 (COVID-19). Analyzing these data can be for social good. 
For instance, data analytics on the healthcare and disease data of- 
ten leads to the discovery of useful information and knowledge 
about the disease. Explainable artificial intelligence (XAI) further 
enhances the interpretability of the discovered knowledge. Conse- 
quently, the explainable data analytics helps people to get a better 
understanding of the disease, which may inspire them to take part 
in preventing, detecting, controlling and combating the disease. 
In this paper, we present an explainable data analytics system for 
disease and healthcare informatics. Our system consists of two 
key components. The predictor component analyzes and mines 
historical disease and healthcare data for making predictions on 
future data. Although huge volumes of disease and healthcare data 
have been generated, volumes of available data may vary partially 
due to privacy concerns. So, the predictor makes predictions with 
different methods. It uses random forest With sufficient data and 
neural network-based few-shot learning (FSL) with limited data. 
The explainer component provides the general model reasoning 
and a meaningful explanation for specific predictions. As a database 
engineering application, we evaluate our system by applying it to 
real-life COVID-19 data. Evaluation results show the practicality of 
our system in explainable data analytics for disease and healthcare 
informatics. 
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1 INTRODUCTION 


With advancements in technology, huge volumes of valuable data 
have been generated and collected at a rapid velocity from a wide 
variety of rich data sources. Examples include: 

e biodiversity data [1], 

e biomedical/healthcare data and disease reports (e.g., COVID- 
19 statistics) [2-4], 
census data [5], 
imprecise and uncertain data [6-9], 
music data [10, 11], 
patent register [12, 13], 
social networks [14-19], 
time series [20-26], 
transportation and urban data [27-31], 
weather data [32], and 
web data [33-37]. 
Embedded in these data is implicit, previously unknown and po- 
tentially useful information and knowledge that can be discovered 
by data science [38-41], which make good uses of: 
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e data mining algorithms [42-50] (e.g., incorporating con- 
straints [51-54]), 

data analytics methods [55-60], 

machine learning techniques [61-63], 

visualization tools [64, 65], and/or 

mathematical and statistical modeling [66]. 


Analyzing these data can be for social good. For instance, analyz- 
ing and mining biomedical/healthcare data and disease helps the 
discovery of useful information and knowledge about the disease 
such as: 


e severe acute respiratory syndrome (SARS), which was caused 
by a SARS-associated coronavirus (CoV) and led to an out- 
break in 2003. 

e Swine flu, which was caused by influenza A virus subtype 

H1N1 (A/H1N1) and led to an outbreak from 2009 to mid- 

2010. 

Middle East respiratory syndrome (MERS), which was caused 

by a MERS-CoV and led to outbreaks in like Middle East (e.g., 

Saudi Arabia) between 2012-2018 and South Korea in 2015. 

Zika virus disease, which was primarily transmitted by the 

bite of an infected mosquito and led to an outbreak in Brazil 

during 2015-2016. 

coronavirus disease 2019 (COVID-19), which was caused by 

SARS-CoV-2. This was reported to break out in 2019, became 

a global pandemic in March 2020, and is still prevailing in 

2021. 


Discovered information and knowledge helps prevent, detect, con- 
trol and/or combat the disease. This, in turn, helps save patient life 
and improve quality of our life. Hence, it is useful to have a data 
analytics system for analyzing and mining these data. In response, 
we present a data analytics system for analyzing and mining disease 
informatics and healthcare informatics (aka health informatics). 
Take COVID-19 as an example of diseases. Since its declaration 
as a pandemic, there have been more than 180 million confirmed 
COVID-19 cases and more than 3.9 million deaths worldwide (as 
of July 01, 2021). These have led to huge volumes of valuable 
data. However, partially due to privacy concerns (e.g., to privacy- 
preserving data publishing) and/or fast reporting of the information, 
the volume of available data may vary. As such, it is important 
to design the data analytics system in such a way that it could 
discover useful knowledge based on various volumes of available 
data. In response, our data analytics system is designed in such a 
way that it makes predictions on future data based on the analysis 
and mining on various volumes of historical disease and healthcare 
data. Specifically, it makes predictions with different methods by 
using: 
e random forest with the available data are sufficient, and 
e neural network (NN)-based few-shot learning (FSL) with the 
available data are limited. 


Advancements in technology have led to huge volumes of valu- 
able data—which may be of a wide variety of data types stored in a 
wide variety of data formats—have been generated and collected 
at a rapid velocity from a wide variety of rich data sources. They 
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can also be at different levels of veracity. Hence, in terms of charac- 
teristics of these data, they are often characterized by multiple V’s 
(e.g., 5V’s, 6V’s, 7V’s, etc.). These include: 


(1) value, which focuses on data usefulness; 

(2) variety, which focuses on differences in data formats (e.g., 
comma-separated values (CSV), JavaScipt object notation 
(JSON)), types (e.g., structured relational/transactional data, 
semi-structured text on the web, unstructured audio or video), 
and/or sources; 

(3) velocity, which focuses on data generation and collection 
rates; 

(4) veracity, which focuses on data quality (e.g., precise vs. im- 
precise and uncertain data); 

(5) volume, which focuses on data quantity; as well as 

(6) validity and visibility, which focus on data interpretation 
and visualization. 


As “a picture is worth a thousand words", data interpretation and 
visualization enhance the data validity and visibility. Moreover, for 
disease and healthcare informatics, having the ability to interpret 
and visualize the data and/or knowledge discovered by data ana- 
lytics is desirable. It is because this ability would further enhance 
user understanding of the disease. Hence, we incorporate explain- 
able artificial intelligence (XAI) to our data analytics system so that 
the resulting explainable data analytics system not only makes 
accurate predictions but also provides reasoning and meaningful 
explanations for specific predictions. 

In terms of related works, there have been works on disease and 
healthcare informatics. As an example, since the declaration of 
COVID-19 as a pandemic, researchers have focused on different 
aspects of the COVID-19 disease. Examples include: 


e Some social scientists have studied crisis management for 
the COVID-19 outbreak [67]. 
Some medical and health scientists have focused on clinical 
and treatment information [68]. Some others have focused 
on drug discovery and vaccine development (e.g., messenger 
ribonucleic acid (mRNA) vaccines like Moderna and Pfizer- 
BioNTech, adenovirus vector vaccines like AstraZeneca and 
Janssen, inactivated virus vaccines, subunit vaccines) [69]. 
e Some natural scientists and engineers have examined arti- 
ficial intelligence (AI)-driven informatics, sensing, imaging 
for tracking, testing, diagnosis, treatment and prognosis [70] 
such as those imaging-based diagnosis of COVID-19 using 
chest computed tomography (CT) images [71, 72]. They have 
also come up with mathematical modelling of the spread of 
COVID-19 [73]. 


Similar to the last category of examples, we also focus on Al-driven 
informatics—in particular, disease and healthcare informatics. How- 
ever, unlike the last category, we focus on textual data (e.g., blood 
test results) instead of images like CT images. Note that medical and 
healthcare data are expensive to produce. For instance, CT images 
requires radiologists to supervise the operation of CT scanners, 
which are often available in large hospitals instead of small clinics. 
In contrast, blood tests are more regular procedures, which are 
more accessible and easily carried out by most medical staff even 
in small clinics. 
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Our key contributions of this paper is our explainable data ana- 
lytics for disease and healthcare informatics. We design the system 
in such a way that it consists of two key components: 


(1) The predictor component analyzes and mines historical dis- 
ease and healthcare data for making predictions on future 
data. It makes predictions with different methods based on 
the volume of available data. To elaborate, it uses: 

e random forest With sufficient data, and 

e network-based few-shot learning (FSL) with limited data. 
(2) The explainer component provides: 

e the general model reasoning, and 

e a meaningful explanation for specific predictions. 


As a database engineering application, we evaluate our system 
by applying it to real-life blood test results for COVID-19 data. 
Evaluation results show the practicality of our system in explainable 
data analytics for disease and healthcare informatics. 

The remainder of this paper is organized as follows. The next 
section discusses related works. Section 3 describes our explainable 
data analytics for disease and healthcare informatics. Section 4 
shows evaluation results and Section 5 draws the conclusions. 


2 BACKGROUND AND RELATED WORKS 


Recall from Section 1, since its declaration as a pandemic, there have 
been more than 180 million confirmed COVID-19 cases and more 
than 3.9 million deaths worldwide (as of July 01, 2021). In Canada, 
there have been more than 1.4 million confirmed COVID-19 cases 
and more than 26 thousand deaths’. Like many other viruses, SARS- 
CoV-2 (which has caused COVID-19) also mutates over time. This 
leads to several SARS-CoV-2 variants’, which can be categorized 
as: 


e variants of concern (VOC): 
— alpha (lineage B.1.1.7), which samples were first docu- 
mented in the UK in September 2020; 
— beta (B.1.351), which samples were first documented in 
South Africa in May 2020; 
— gamma (P.1), which samples were first documented in 
Brazil in November 2020; and 
— delta (lineages B.1.617.2, AY.1, and AY.2), which samples 
were first documented in India in October 2020. 
e variants of interest (VOI): 
— eta (lineage B.1.525), which samples were first documented 
in December 2020; 
— iota (B.1.526), which samples were first documented in the 
USA in November 2020; 
— kappa (B.1.617.1), which samples were first documented 
in India in October 2020; 
— lambda (C.37), which samples were first documented in 
Peru in December 2020. 


Since January 2021, more than 187 thousand Canadian COVID-19 
cases have screened and sequenced. As of July 01, 2021, there have 


“https://www.ctvnews.ca/health/coronavirus/tracking-every-case-of-covid-19-in- 
canada-1.4852102 
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been more than 222 thousand alpha, 2 thousand beta, 18 thousand 
gamma, and 4 thousand delta cases‘. 
Hence, it is crucial for early detection of COVID-19 infection. To 


do so, two main types of tests are used: 


e antibody tests (aka serology tests), which test for past infec- 
tion. Specifically, these tests look for the presence of proteins 
created by the immune system soon after the persons have 
been infected or vaccinated. 

e viral tests (aka diagnostic tests), which test for current infec- 
tion. Specifically, two common viral tests are: 

— molecular tests—such as polymerase chain reaction (PCR) 
tests (aka nucleic acid amplification tests (NAATs))—which 
look for the presence of viral genetic materials, and 

— antigen tests, which look for presence of some specific 
proteins from the virus. 

Between the two main types of viral tests, molecular tests 
are more costly, more invasive (e.g., require nasopharyngeal 
(NP) swab) and may take several hours to days in traditional 
laboratory settings in obtaining the test results when com- 
pared with the antigen tests. However, in addition to being 
specific (in the sense of being able to correctly identify those 
without the disease, i.e., high true negative rate), molecu- 
lar tests are more sensitive (in the sense of more capable 
to correctly identify those with the disease, i.e., higher true 
positive rate) than the antigen tests. Hence, considering the 
tradeoff between the two main types of viral tests, it is desir- 
able to have an efficient but also accurate way to determine 
or predict the results (i.e., positive or negative for diseases 
like COVID-19). 


In data analytics, it is common to train a prediction model with 
lots of data because data are one of (if not the most important) 
resources for researchers in many fields. Analysts look for trends, 
patterns and commonalities within and among samples. However, 
in the medical and health science field, samples can be expensive to 
obtain. Moreover, due to privacy concerns and other related issues, 
few samples may be made available for analyses. This motivates 
our current work on designing a data analytics system that uses 
different methods make predictions based on the availability of data 
(e.g., sufficient or limited data). 

In terms of related works, Rustam et al. [74] forecasted new 
COVID-19 cases in three different dimensions—i.e., number of new 
cases, death rate, and recovery rate—based on historical time se- 
ries on these dimensions collected by John Hopkins University. 
Although their model was useful in forecasting new cases, it relied 
on how extensively the population has been tested. Data from re- 
gions without extensive tests (e.g., due to lack of resources) could 
lead to unreliable predictions. In contrast, our current work focuses 
on forecasting the positive and negative COVID-19 cases. 

Brinati et al. [75] presented a machine learning solution for 
predicting COVID-19 with routine blood test data collected from 
a hospital in Milan, Italy. They analyzed a dataset of 279 patients, 
of which 177 were tested positive and 102 were tested negative. 
They observed that the random forest (RF) model led to an accurate 
prediction. As for the model interpretation (e.g., level of importance 


“https://www.ctvnews.ca/health/coronavirus/tracking-variants-of-the-novel- 
coronavirus-in-canada-1.5296141 
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of the features), they relied on validation from solely healthcare 
professional. Like theirs, our current work also uses RF when the 
available data are sufficient. However, unlike theirs, our current 
work provides healthcare professional with the general reasoning 
of the prediction model and meaningful explanations on specific 
predictions. This helps reduce the workload of busy healthcare 
professional. 

Xu et al. [76] combined the lab testing data and CT scan images to 
develop a robust model, which helped differentiate COVID-19 cases 
from other similar diseases and the degree of severity. Their model 
classified the patients according to the following labels: non-severe 
or severe COVID-19, as well as healthy or viral pneumonia. They 
used a combination of traditional machine learning models (e.g., 
RF) and deep learning (e.g., convolutional neural networks (CNN)). 
However, it is important to note that CT scan images are expensive 
to produce. Moreover, sophisticated deep learning techniques may 
also incur high computational costs. In contrary, our current work 
does not rely on the expensive CT scan images or computationally 
intensive deep learning techniques, while maintaining reasonably 
well prediction results. 

While the RF can handle situations in which data are sufficient, 
we explore few-shot learning (FSL) [77] to handle situations in 
which data are limited. In general, FSL is a machine learning tech- 
nique, which aims to learn from a limited number of examples in 
experience with supervised information for some classes of task. 
It has become popular. For instance, Snell et al. [78] applied the 
FSL to the minilmageNet dataset with 5-shot modelling and to the 
Omniglot dataset. 


3 OUR EXPLAINABLE DATA ANALYTICS 
SYSTEM 


3.1 Overview 


Our explainable data analytics system for disease and healthcare 
informatics consists of two key components: 


(1) predictor component, which analyzes and mines historical 
disease and healthcare data for making predictions on future 
data. To handle different levels of data availability, it uses: 
e the random forest (RF) when data are sufficient, and 
e the neural network (NN)-based few-shot learning (FSL) 

when data are limited. 

(2) explainer component, which provides: 

e the general model reasoning, and 
e meaningful explanations to specific predictions. 


3.2 Our Predictor for Sufficient Data 


To aim for accurate prediction, our predictor first cleans data and 
engineers features. Although the imputation techniques (which 
usually serve to preserve observations on missing information) 
work well in many applications, we choose not to use them for 
medical data because any changes in the range of clinical values 
could lead to misleading results. Instead, given a dataset with many 
variables (aka features or columns) and observations, we clean 
the data by first profiling them. We preserve all columns with 
less than certain threshold (say, 90%) of missing values. To reduce 
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dimensionality, we remove columns that are highly correlated to 
others because such a removal would lead to the same outcome due 
to their high correlation. Then, we remove columns with constant 
values because they do not offer any nuances between observations. 

Afterwards, our predictor applies some engineering to the re- 
maining features. As an example for COVID-19 data, when the 
number of records with missing values is high, we create an extra 
column to detect if the patient was tested positive for any other 
viruses (say, 19 other viruses). This single column helps reduce 
the sparsity of many columns (e.g., these 19 columns). We then 
transform categorical features to numerical ones, and then use the 
Pearson’s Correlation test—as shown in Eq. (1)—to check the feature 
interactions with the target outcome. 

cou(X, Y) 


px = ——— (1) 
Ox OY 


where 


e cou(X, Y) is the covariance of X and Y; and 
e ox and oy are standard deviations of X and Y, respectively. 


Once the data are cleaned and engineered, our predictor splits 
the dataset into training and testing sets. It also applies stratifica- 
tion when doing so. In other words, the distribution of each class 
(positive and negative) was proportional between the two sets. The 
difference in the ratio between positive and negative classes in the 
training set may cause the “class imbalance" problem, in which 
one class dominates the learning process of the model. To avoid 
this problem, we apply a k-means clustering on the training set to 
reduce the number of negative classes. By doing so, we eliminate 
some of the instances based on clustering similarities and removing 
data points similar to others that remained. We also explore a good 
ratio. 

In addition, our predictor applies a random search on the training 
dataset to find the best set of hyperparameters for RF prediction. 
As an ensemble algorithm for machine learning classification and 
regression prediction, RF evolves from the decision tree algorithm. 
It enhances the predictive power by using a forest of multiple 
trees. Each tree receives different portions of the dataset, and the 
final prediction result is an average of the individual’s decision 
tree results available in the forest. To evaluate the performance 
and determine the best set of hyperparameters, we apply a 10-fold 
cross-validation to evaluate the results in different portions of the 
dataset. The cross-validation technique serves to approximate the 
model results of an unseen dataset, and thus avoids over-fitting. 


3.3. Our Predictor for Limited Data 


To handle situations in which the available data are limited, our 
predictor uses FSL. With an autoencoder architecture, it learns 
internal representations by error propagation. Specifically, once the 
data are cleaned and engineered (as described in Section 3.2), it first 
builds an autoencoder with four fully connected layers to reconstruct 
the input features. Here, the input layer takes n features. With a 
hidden layer, the encoder module of the autoencoder maps the 
input features {F;}‘_, to produce encoded features. With another 
hidden layer, the decoder module of the autoencoder maps the 
encoded features to produce reconstructed input {R;};"_,. During 
the process, the autoencoder aims to minimize the loss function 
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Lp in the reconstruction of input features. This loss function Lp is 
computed as a mean absolute error: 


1 n 
Lr= — >) Fi - Ril (2) 
i=1 


where 


e nis the number of trained data samples, 

e Fj is one of the n features {F;}'_, fed into (the encoder mod- 
ule of) the autoencoder, and 

e R; is one of the n reconstructed features {Rj lah reconstructed 
by (the decoder module of) the autoencoder. 


Afterward, the predictor freezes the autoencoder and builds a 
neural network with two fully connected layers to make prediction 
on class label (i.e., positive or negative). Here, the input layer takes 
encoded features from the frozen autoencoder (specifically, the 
encoder module). With a hidden layer, the prediction module maps 
the encoded features to produce a single-label prediction: 

a 0 for negative class label (3) 
4i~\ 1 for positive class label 


During the process, the autoencoder aims to minimize the loss func- 
tion Lp in the reconstruction of input features. This loss function 
Lp is computed as a binary cross-entropy loss: 


m 


Lp= = )\[ylog(yi) +(1-ylogt- yi] ) 
i=1 


where m is the total number of limited data samples per class label. 


3.4 Our Explainer 


To provide explanations to user, once the data are cleaned and en- 
gineered, our explainer first computes Pearson’s correlation as per 
Eq. (1) to find correlation among different cleaned and engineered 
features. It represents the correlation in a heat map. 

In addition, our explainer also produce a partial dependence plot 
(PDP) explaining the interactions between independent features 
and the target one. The PDP also explains combination of feature 
importance. 

Following the general explanation given above, our explainer 
also explores specific instances of explanations to have a deeper 
understanding level. To elaborate, for each specific prediction, our 
explainer examiners the degree of positive or negative contribution 
of each feature by showing a bar chart. So, features are listed in the 
y-axis, and degrees of (positive or negative) contribution are indi- 
cated in the x-axis. The range of degree of positive contributions 
goes from 0 to +1, whereas the range of degree of negative contri- 
butions goes from 0 to —1. Values of each attribute are normalized 
with the average of being 0, maximum values are normalized to +1 
and minimum values are normalized to —1. The maximum and min- 
imum degrees indicate a high positive and negative contribution, 
respectively. 
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4 EVALUATION 


To evaluate our system, we conducted experiments on a real-life 
open dataset on COVID-19 dataset”. Specifically, the dataset con- 
tains samples collected to perform the SARS-CoV-2 reverse tran- 
scription polymerase chain reaction (RT-PCR) and additional lab- 
oratory tests during a visit to a hospital in the state of Sao Paulo, 
Brazil. After data cleaning and preprocessing, the dataset contains 
5,644 potential patients (COVID-19 or not), aka samples or instances. 
Each instance captures: 


e an ID; 

e age group; 

e hospitalization status—such as (a) admitted to regular ward, 
(b) admitted to semi-intensive unit (SIU), or (c) admitted to 
intensive care unit (ICU) ward); 

e 106 features (69 numerical features and 37 categorical fea- 
tures); and 

e aclass label (ie., tested positive or negative). 


In term of data distribution, 558 instances (i.e., 9.9% of 5,644 in- 
stances in the dataset) were tested/labelled “positive” for the SARS- 
CoV-2 test result, and the remaining 5,086 instances (i.e., 90.1%) 
were tested/labelled “negative”. 

Initially, we started with a dataset with 112 variables (aka features 
or columns) and 5,644 observations. Observing that more than 88% 
of the values were missing, we preserved columns with less than 
90% of missing values. Then, we removed highly correlated columns 
and constant-value columns. We also observed that 19 variables 
were there to indicate the presence of other viruses (e.g., adenovirus, 
influenza A, rhinovirus). As such, to further reduce dimensionality, 
we replace these 19 columns by a single column to indicate the 
instance was tested positive for at least one of the 19 viruses. At 
the end of the data cleaning and feature engineering process, we 
were left with 16 important features. 

Figure 1 shows a heat map with all the computed correlations. 
All rows except the last one (i.e., first 15 rows) are independent 
variables, and the last row corresponds to the target feature “SARS- 
CoV2 exam result". Observed from this figure, both leukocytes 
and platelets are features that present a higher correlation with 
the target variable. From immunological and clinical viewpoints, 
leukocyte is a general term for all white blood cell, and platelet is a 
type of white blood cells in blood in charge wound healing. Thus, 
they are highly correlated. Based on that, we eliminated any rows 
that had missing values for these two variables. We ended up with 
a dataset with (16 features and) 598 rows (i.e., observations). 

We split the dataset with stratification into 70% for training 
and 30% for testing such that each class (positive and negative) 
was proportional between the two sets. Consequently, we used 
418 observations (with 361 negatives and 57 positives) for training 
set, and 180 instances (with 156 negatives and 24 positives) for 
testing. Observing the ratio between positive and negative classes 
in the training set was around 6.3 (i.e., “class imbalance"), we applied 
a k-means clustering on the training set to reduce the number of 
negative classes by eliminating some instances based on clustering 
similarities and removing similar data points in the remainder. 


*https://www.kaggle.com/einsteindata4u/covid19 
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Figure 1: Our explainer shows a heat map to explain correlations among dependent and target features 


Consequently, we reduce the ratio from 6.3 to 1.52 (with 87 negatives 
and 57 positives). 

We ran our RF-based predictor with these 418 training observa- 
tions and 180 testing instances. In addition, we also compared our 
predictor with several existing ML models: 
logistic regression [79], 
decision tree [79], 
gradient boosting [79], and 
linear support vector classification (SVC) [79]. 


We measured F1 score, which is computed by: 


Zr 


—— R) 
22P+FP+FN ©) 


F1 score = 


where 


e TP is true positives, 
e FPis false positives, and 
e FNis false negatives. 


Each experiment was ran for 5-fold cross validation and the average 
F1 score was computed. Table 1 shows that our RF-based predictor 
makes accurate predictions with sufficient data. It led to the highest 
F1 score, the highest number of true positives (TP), and the lowest 
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Table 1: Comparison of our random forest (RF)-based predic- 
tor with related works 


F1 TP EN 

Logistic regression [79] 0.86 45.6 3.0 
Decision tree [79] 0.81 42.2 64 
Gradient boosting [79] 0.85 440 4.6 
Linear SVC [79] 0.85 448 3.8 
Our RF-based predictor 0.86 46.2 2.4 


number of false negatives (FN). Note that FN may lead to late 
detection of untreated patients, causing wider spread of the disease, 
which in turn may lead to more cases. 

Moreover, we also ran our NN-based FSL predictor with a few 
samples (e.g., 1 and 5 samples for each class label) from the 418 train- 
ing observations and 180 testing instances. Again, we also compared 
our predictor with the aforementioned existing ML models. Tables 2 
and 3 both show that our NN-based FSL predictor makes accurate 
predictions with limited data. It led to the highest F1 score, the 
highest number of true positives (TP), and the lowest number of 
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Figure 2: Our explainer shows a partial dependence plot to explain feature importance 


Table 2: Comparison of our neural network (NN)-based few- 
shot learning (FSL) predictor with related works when using 
1 sample per class label 


1 sample per class Fl TP FN 
Logistic regression [79] 0.52 108.6 133.4 
Decision tree [79] 0.52 127.0 115.0 
Gradient boosting [79] 0.55 112.4 129.6 
Linear SVC [79] 0.55 123.2 118.8 


Our NN-based FSL predictor 0.75 241.0 1.0 


Table 3: Comparison of our NN-based FSL predictor with re- 
lated works when using 5 samples per class label 


5 samples per class Fl TP FN 
Logistic regression [79] 0.68 156.4 81.6 
Decision tree [79] 0.66 152.2 85.8 
Gradient boosting [79] 0.66 153.2 848 
Linear SVC [79] 0.63 143.4 94.6 


Our NN-based FSL predictor 0.70 225.2 12.8 


false negatives (FN). Note that FN may lead to late detection of 
untreated patients, causing wider spread of the disease, which in 
turn may lead to more cases. 

The above evaluation results show the effectiveness and practi- 
cality of our predictors (with sufficient and limited data). In addition, 
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we also evaluated our explainer. For instance, it captures some in- 
sights from the model reason for the predictions by producing a 
partial dependence plot (PDP)—as shown in Figure 2—to explain 
the interactions between independent features and the target one. 
An interpretation of this figure is as follows: 


e The most important feature for the prediction is “Leuko- 
cytes". In the immunological and clinical context, leukocyte 
is a general term for all white blood cell. 

e The least important is “Mean corpuscular hemoglobin". In 
the immunological and clinical context, mean corpuscular 
hemoglobin concentration (MCHC) measures the concentra- 
tion of haemoglobin in a given volume of packed red blood 
cell. It can be computed by dividing the hemoglobin by the 
hematocrit. A low level of MCHC is an indication of anemia. 

e Low levels of leukocytes, eosinophils and platelets all in- 
crease the probability towards a positive prediction (COVID- 
19 positive). Again, in the immunological and clinical context, 
eosinophils is a type of disease-fighting white blood cell. A 
high level of eosinophils often indicates a parasitic infection, 
an allergic reaction or cancer. 

e When other kinds of viruses are detected in the patient, the 
probability increases towards a negative prediction (COVID- 
19 negative). 


Following the general explanation given above, our explainer 
also explores specific instances of explanations to have a deeper 
understanding level. To elaborate, for each specific prediction, our 
explainer examiners the degree of positive or negative contribution 
of each feature by showing a bar chart. See Figures 3 and 4 for two 
examples. 
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Figure 3: Our explainer provides an explanation for a COVID-19 true negative instance 
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To elaborate, Figure 3, our predictor makes a prediction that 
the patient had 34% of chances to have the COVID-19, i.e., likely 
to be negative in our binary classification model. Our explainers 
provide the specific reasoning for this instance: “Platelets,” “other 
virus detected" and “leukocytes" contribute negatively because their 
(negative) values indicate these features for this specific patient de- 
creased the probability of being a positive diagnostic, i.e., increased 
the probability of being a negative diagnostic. We observe that 
these negatively contributing features follow the general explana- 
tions provided by our explainer in Figure 2. Hence, our explainer 
supports clinicians in trusting the prediction made by our predictor. 

Moreover, these observations and explanations—e.g., high level 
of platelets, detection of other virus, high level of leukocytes (i.e., 
high number of white blood cells)—are consistent with the medical 
literature. To elaborate, platelets are involved in blood clotting. It 
was evidenced that “blood tests in symptomatic COVID-19 show a 
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Figure 4: Our explainer provides an explanation for a COVID-19 true positive instance 
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deficiency of lymphocytes and white blood cells, in general. Lower 
platelet counts are a marker of higher mortality in hospitalized 
patients." [80] “Early reports from China suggested that co-infection 
with other respiratory pathogens was rare. If this were the case, 
patients positive for other pathogens might be assumed unlikely to 
have SARS-CoV-2. The Centers for Disease Control and Prevention 
endorsed testing for other respiratory pathogens, suggesting that 
evidence of another infection could aid the evaluation of patients 
with potential COVID-19 in the absence of widely available rapid 
testing for SARS-CoV-2." [81, 82]. Hence, high levels of platelets and 
leukocytes, as well as detection of other virus, are key contributing 
factors to true negatives. 

Figure 4 shows an example of our explainer in providing ex- 
planation for a predicted COVID-19 positive diagnostic with 78% 
probability. Here, we had many factors that increased the probabil- 
ity of a confirmed diagnosis. Again, we observe that these positively 
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contributing features follow the general explanations previously de- 
scribed in Figure 2. Moreover, these observations and explanations— 
e.g., low level of leukocytes (aka leukopenia, which refers to a low 
level of white blood cells), low level of platelet)—are consistent 
with the medical literature. It was evidenced that “blood tests in 
symptomatic COVID-19 show a deficiency of lymphocytes and 
white blood cells, in general. Lower platelet counts are a marker 
of higher mortality in hospitalized patients." [80] Hence, low levels 
of leukocytes and platelets, together with some other features, are 
key contributing factors to true positives. 


5 CONCLUSIONS 


In this paper, we present an explainable data analytics system for 
disease and healthcare informatics. Our system consists of two 
key components. The predictor component analyzes and mines 
historical disease and healthcare data for making predictions on 
future data. As volumes of available data may vary, the predictor 
makes predictions with a random forest model when the available 
data are sufficient. It makes predictions with a neural network-base 
few-shot learning model when the available data are limited. The 
explainer component provides the general model reasoning for the 
prediction by showing heat maps to explain correlations among 
dependent features and target class labels, as well as partial depen- 
dence plot to explain importance of these features. In addition to 
general model reasoning, our explainer also provides explanations 
to specific instance by showing how the normalized values of fea- 
tures increase or decrease the probability of having a positive label. 
In other words, it explains the feature values that contributing to- 
wards a positive or negative prediction made by our predictor. As a 
database engineering application, we demonstrate and evaluate our 
system by applying it to real-life COVID-19 data. Evaluation results 
show the practicality of our system in explainable data analytics for 
disease and healthcare informatics on COVID-19. It is important 
to note that our system is designed in such a way that it is capable 
of handling other disease. As ongoing and future work, we transfer 
knowledge learned here to disease and healthcare informatics of 
other disease. We also explore incorporating user preference in 
providing users with explainable data analytics for disease and 
healthcare informatics. 
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ABSTRACT 


Public and private organizations produce and store huge 
amounts of documents which contain information about their 
domains in non-structured formats. Although from the final 
user’s point of view we can rely on different retrieval tools 
to access such data, the progressive structuring of such doc- 
uments has important benefits for daily operations. While 
there exist many approaches to extract information in open 
domains, we lack tools flexible enough to adapt themselves 
to the particularities of different domains. 

In this paper, we present the design and implementation of 
ICIX, an architecture to extract structured information from 
text documents. ICIX aims at obtaining specific information 
within a given domain, defined by means of an ontology which 
guides the extraction process. Besides, to optimize such an 
extraction, ICIX relies on document classification and data 
curation adapted to the particular domain. Our proposal has 
been implemented and evaluated in the specific context of 
managing legal documents, with promising results. 


CCS CONCEPTS 


e Information systems — Document structure; Ontologies; 
Content analysis and feature selection. 


KEYWORDS 


Information extraction, ontologies, text classification 


ACM Reference Format: 

Angel L. Garrido, Alvaro Peiro, Cristian Roman, Carlos Bobed, 
and Eduardo Mena. 2021. ICIX: A Semantic Information Extrac- 
tion Architecture. In 25th International Database Engineering 


Permission to make digital or hard copies of all or part of this work 
for personal or classroom use is granted without fee provided that 
copies are not made or distributed for profit or commercial advantage 
and that copies bear this notice and the full citation on the first page. 
Copyrights for components of this work owned by others than the au- 
thor(s) must be honored. Abstracting with credit is permitted. To copy 
otherwise, or republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. Request permissions 
from permissionsQ@acm.org. 

IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 

© 2021 Copyright held by the owner/author(s). Publication rights 
licensed to ACM. 

ACM ISBN 978-1-4503-8991-4/21/07...$15.00 

https: //doi.org/10.1145/3472163.3472174 


IDEAS 2021: the 25th anniversary 


Alvaro Peiro 
apeiroQisyc.com 
InSynergy Consulting S.A. 
Madrid, Spain 


Cristian Roman 
cromanQ@isyc.com 
InSynergy Consulting S.A. 
Madrid, Spain 


Eduardo Mena 


emenaQ@unizar.es 
University of Zaragoza 
Zaragoza, Spain 


& Applications Symposium (IDEAS 2021), July 14-16, 2021, 
Montreal, QC, Canada. ACM, New York, NY, USA, 9 pages. 
https: //doi.org/10.1145/3472163.3472174 


1 INTRODUCTION 


Information Extraction (IE) is the task of automatically 
extracting structured information from unstructured or semi- 
structured documents which, in most cases, involves process- 
ing human language texts. IE is pervasive in all kinds of 
fields (e.g., science, laws, news, etc.), and it is still being done 
manually in many organizations dedicated to document man- 
agement. These manually IE tasks are very time-consuming, 
pricey activities, and subject to many human errors. In recent 
years, the use of automatic IE tools has gradually gained pop- 
ularity thanks to the good results obtained in tasks related 
to obtaining structured data from text-based documents [14]. 
These techniques provide user with benefits which contribute 
to their adoption in all types of organizations, private and 
public, as well as to the emergence of dedicated software and 
companies that offer this type of services. Even so, when deal- 
ing with IE, we face one of the great problems related to Ar- 
tificial Intelligence and Natural Language Processing (NLP): 
After almost forty years, the algorithms created to solve this 
type of issues are still far from being generic enough to op- 
erate with any natural language and type of document. In 
fact, when focusing on IE that achieves optimal results, one 
of the greatest difficulties in developing a generic IE system 
is the strong dependence on 1) the domain of the handled 
documents and 2) the specific target natural language [1]. 
Apart from the domain generalization difficulties, we have 
also to bear in mind both document pre and post-processing. 
Regarding pre-processing, past and current NLP applications 
usually work with well-formed texts, but when dealing with 
real scenarios (e.g., health, legal, or industrial documents), 
it is usual that the owner of the documents must keep the 
original ones, and therefore, NLP systems have to work just 
with digital copies (i.e., scanned versions of the documents). 
So, in these situations, it is mandatory to apply an Optical 
Character Recognition system (OCR from now on) to identify 
the text of the document before analyzing it. This introduces 
noise derived from errors in the recognition process which 
compromises the accuracy of the information extraction task. 
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Its automatic correction also implies the adaptation of the al- 
gorithms to the working context. Regarding post-processing, 
automatic information extraction systems usually do not have 
means to check the quality of data, searching for possible 
errors (a task that it is heavily domain-dependent). Among 
the most common IF errors, we can mention: empty or du- 
plicate data, poorly structured data, data distributed among 
several entities which should refer to just one entity and 
vice versa [24]. As a result, it is very complex to create a 
system capable of correctly solving these IE specific tasks in 
a general way for any domain / use case [28]. Under these 
circumstances, research efforts that minimize the customiza- 
tion requirements are always welcome, as they contribute to 
facilitating the design of domain-adaptable systems. 

In this paper, we present our approach to deal with these 
scenarios where we need to extract specific data from exten- 
sive documents belonging to a particular domain. In particu- 
lar, we focus on the following objectives: 1) to create a generic 
architecture applicable to multiple scenarios minimizing cus- 
tom programming; 2) to strengthen the extraction process 
avoiding non relevant information and correcting OCR er- 
rors; 3) to ensure the quality of the extracted information 
by exploiting the available knowledge to perform a review 
of the extracted data and curate possible errors; and 4) to 
allow the integration of additional relevant services (e.g., user 
validation, image capture tools, help, etc.). To achieve each 
of these goals, we propose the ICIX architecture, where each 
objective is achieved as follows: 


(1) the application domain is captured by an ontology (as 
defined by Gruber [11], a formal and explicit specifica- 
tion of a shared conceptualization), which contains not 
only knowledge about the domain, but also knowledge 
about the structure of the different types of documents, 
as well as references to pertinent extracting mecha- 
nisms. This ontology guides the different stages of the 
extraction process. 

the recall and precision of the extraction is improved 

by adding a pre-processing step where text curation [7] 

and automatic text classification are applied. 

(3) the knowledge captured in the ontology is leveraged 
to detect and solve consistency errors in the extracted 
data. 

(4) the architecture is designed so that new modules can 
be assembled allowing additional annex functions. 


(2 


ee” 


We have applied our proposal to the legal domain imple- 
menting the AIs! System, which has been integrated within 
the commercial Content Relationship Management (CRM) 
solution of InSynergy Consulting?, a well-known IT company, 
belonging to the International TESSI eroup®. In this context, 
we have performed a set of preliminary tests with data ex- 
tracted by AIS from a real legal document dataset, composed 


TAIS stands, in Spanish, for Andlisis e Interpretacién Semdntica 
(Analysis and Semantic Interpretation). 

2 A 

http://www.isyc.com 

3https://www.tessi.fr 
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of notarial deeds of constitution of mortgages (in Spanish), 
showing the feasibility and the benefits of our approach. 
The rest of the paper is structured as follows. Section 2 
briefly describes the ICIX architecture. Section 3 provides 
design details of the implementation of the processing docu- 
ments task in a specific context. Section 4 presents the results 
of the empirical study conducted to assess our proposal over 
that context. Section 5 gives an overview of the state of the 
art concerning our approach. And finally, Section 6 offers 
some concluding remarks and directions for future work. 


2 ARCHITECTURE OVERVIEW 


Figure 1 provides an overview of the ICIX architecture. First, 
we will focus on its inputs and outputs, to then describe the 
proposed information repositories and give a brief explanation 
of the document processing steps. Finally, we will comment 
some possible relevant services that can be added to the 
architecture. 


2.1 Inputs and Outputs 


The main input of the system are the documents containing 
the information that has to be extracted. Such documents are 
provided by users of the system in their daily work, and are 
assumed to be text-based. However, they may present some 
difficulties, such as including irrelevant content (titles, page 
numbers, headings, etc.), being very long, or even containing 
overlapping elements as they have been digitized (signatures, 
stamps, etc.). The output is the set of specific data extracted 
from each of these documents, which is stored along them as 
metadata. 


2.2 Information Repositories 


As we can see in Figure 1, we have two main knowledge 
and data repositories in ICIX: the Knowledge Base, and the 
Document Database. 


Knowledge Base: It is the main element that backbones all 
the rest of the ICIX architecture, and stores the knowledge 
which guides the extraction process. Such an ontology must 
model and capture the following elements: 


e Document taxonomy and structure: The different possi- 
ble types of documents are classified hierarchically in a 
taxonomy. Each document class contains information 
about the sections that shape their kind of documents. 
Moreover, each of these sections includes further in- 
formation about which properties and entities have to 
be extracted from them. This information comprises 
different aspects apart from just inclusion (e.g., con- 
straints that the elements extracted from a section 
must hold), making it possible to check the validity of 
the extraction results. 

e Entities and extraction methods: ‘The ontology also 
stores which entities have to be obtained, how they 
should be processed, and how they relate to other 
entities. Besides, the key attributes (i.e., the set of 
attributes which defines an entity uniquely) for each 
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Figure 1: Overview of the ICIX architecture. The dotted box limits the information extraction system. Each box is one of the 
internal services of the proposed architecture, all of them connected to the knowledge base. Services 1-4 are those that perform 
the extraction process itself and 5 and 6 are examples of complementary services that can be coupled to the system. The grey 
arrows represent the document data flows, and the white arrows the users’ interaction with useful additional services. To apply 
ICIX in different contexts, it is only necessary to modify the Knowledge Base and the Database. 


of the entities are marked as such (e.g., the name, 
the surname, and the national identity document of 
a entity Person). This decoupled knowledge makes 
it possible to reuse and adapt easily the extraction 
operations developed for each entity among different 
types of documents. 

Entity instances: Finally, the ontology also stores infor- 
mation about previously extracted and curated data as 
instances of the different entities defined in the model. 
This information is the extensional knowledge that our 
proposed approach uses. 

Configuration data: The ontology also contains the in- 
formation necessary to parametrize other aspects of the 
different systems, thus easing the general configuration 
of the system and its adaptation to different contexts. 


The Knowledge Base is connected to all the processing 
modules, and as we will see below, it is the knowledge that 
provides them with the necessary information to carry out 
their work. 


Documents Database: It is the information repository where 
the documents are stored. For each document, it stores two 
versions of its text: the original, and the cleaned one. Besides, 
each document is augmented with different attributes (which 
can be document annotations or the results of an extraction): 
typology, properties, temporal data, extracted data, etc. 
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2.3 Document Processing 


In Figure 1, we can see the main services that handle the 
documents sent for processing and subsequent storage (ele- 
ments 1-4): 


(1) Pre-process Service: Even when the OCR is perfect, 
spelling problems or noisy words can be found. More- 
over, the scanned documents can have overlapping 
elements like stamps and signatures, which further 
hinders its automated treatment. For solving these 
problems we propose including a text cleaning and cor- 
recting service for automatically recovering data from 
large documents (text correction is a well-know and 
broadly applied task in NLP [20]). The configuration 
parameters of all these tools will be mapped into terms 
in the ontology. The classification service will use the 
text obtained in this pre-process step as input. 

(2) Classification Service: This service automatically cat- 
egorizes the content entered by users, using text clas- 
sification techniques over a set of possible categories 
which are defined in the ontology. To train the classifi- 
cation model, we build on annotated documents which 
are obtained by introducing some users in the loop. 
To avoid classification drifting, we should retrain and 
redeploy the model as soon as we detect deviations. 
In this module, we must store the train data and the 
different trained models. 
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(3) Extraction Service: This is the core service of ICIX. 
The previous cleaning and correction steps, as well 
as knowing the typology of the document at hand, 
boosts strongly its performance in specific information 
extraction tasks. Once the service knows the type of 
document, it queries the ontology to obtain which ele- 
ments it must look for in the text, and which extraction 
methods must be used to obtain their related informa- 
tion. The element-value pairs obtained in this process 
are the basis on which the Data Curation Service works. 

(4) Data Curation Service: Once the extraction process is 
completed, our proposal leverages the knowledge stored 
in the domain ontology to: 1) perform a review of the 
extracted data, curating possible errors, and 2) improve 
substantially the quality of the results by correcting 
and enriching them exploiting the available knowledge. 


At the end of the last stage, all the information about the 
document (i.e., the text obtained in the pre-processing, the 
results of the classifier, and the extracted information) are 
stored in the document database along with it. 


2.4 Other Relevant Services 


For completeness sake, we show how we can extend our 
architecture to exploit the extracted data building relevant 
services on top. In particular, in Figure 1 we can see two 
services (modules 5 and 6) which provide specific capabilities 
that complement the process in order to, on the one hand, 
validate and obtain information using image capture devices, 
and on the other hand, help and guide the users: 


e Visual Validation & Extraction Service: This service 
deals with the treatment of specific documents used to 
identify /validate the users, such as identity cards, driv- 
ing licenses, etc. This service has been provided with 
the infrastructure and integration necessary to imple- 
ment biometric recognition, combining streaming video 
capture with Artificial Intelligence (AI) techniques to 
identify both people and documents. 

e Dialogue Service: As documentation processing sys- 
tems can reach a certain complexity, some mechanism 
of assistance and help to the user is required. ICIX 
handles this issue by taking advantage of the informa- 
tion available in its knowledge base and allowing to 
implement conversational agents on top of it. Our pro- 
posal includes the inclusion of such an agent which can 
communicate with the user in a multimodal way. ‘The 
responses might be in text or video, or directly lead 
the user to a specific page on the Web. The knowledge 
supporting this bot is also represented in the ontology. 


3 DOCUMENT PROCESSING: LEGAL 
DOMAIN 


The proposed architecture must be grounded to a particular 
domain and system implementation. To validate our proposal, 
we have selected the legal domain. Legal documents, such 
as notarial and judicial acts, or registration documents, are 
often extensive and the relevant data to be identified for 
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most of the document management tasks are often few and 
very specific. Given the importance of having precise results, 
the identification of these data is still handmade in many 
organizations. Thus, it is a good application scenario of the 
ICIX architecture, and it will help us to show the details of 
the design of the document processing system. 

The AIS System: The architecture presented in the pre- 
vious section has lead the implementation of AIS, the in- 
formation extraction system integrated in OnCustomer*, a 
commercial Content Relationship Management (CRM) sys- 
tem developed by the company InSynergy Consulting. The 
input of the system consists of a set of legal documents in 
PDF format with unstructured or semi-structured informa- 
tion, and their type. All the documents have been previously 
scanned and processed by ABBYY®, an Optical Character 
Recognition (OCR) tool. The output is a set of XML files 
containing structured data extracted from each input text, 
and which will be stored in a PostgreSQL database. We 
have used Open Refine’ for standardizing, validating, and 
enriching the extracted data. Regarding the attached systems, 
the validation service has been implemented with the tools 
provided by Veridas® and dialogue service has been imple- 
mented as a videobot implemented using Dialogflow”. All 
these elements are connected with the ontology, which pro- 
vides both configuration information and specific data. The 
following sections provide the most important details, which 
extend and integrate the proposals presented in [4, 5, 17], 
works we refer the interested reader to for further details. 


3.1 Legal Domain: Document Ontology 


Figure 2 shows an excerpt of the ontology that defines, for 
example, a notarial document: 


e Black boxes represent high level definitions that are 
used regardless of the document typology: document, 
sections, and entities. 

e Gray boxes represent data that are shared among no- 
tarial documents regardless of their particular purpose 
(i.e., the introductory section and the appearances). 

e White boxes represent distinctive information of this 
typology (e.g., the purchased property and its surround- 
ings). 

e Dotted boxes represents the extraction methods, and 
the standardization and enriching operations, which 
are modeled in the ontology as annotations of the 
corresponding classes. 


Regarding the relations defined in the ontology, straight 
arrows represent ‘is-a’ relationships, while dashed arrows 
represent the ‘contains’ relation. ‘This model makes it possible 
to define a modular knowledge-guided architecture, so new 


“http: //www.isyc.com/es/soluciones/oncustomer.html 
°http://www.isyc.com 
Shttps: //www.abbyy.com/ 
“http: //openrefine.org/ 
https: //www.das-nano.com/veridas/ 
*https: //dialogflow.com/ 
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Figure 2: Excerpt of the AIS Ontology showing information related to a notarial deed and references to extraction methods. 


document types can be added dynamically to perform the 
data extraction. 

Apart from storing the knowledge about the structure of 
the documents in the domain, this ontology is extended and 
populated with the information stored in the repositories 
of real environments where AIS is currently being used (as 
well as with previously extracted facts). This is done by 
means of a periodic process using R2RML mapping!2, which 
basically maps each of the entities and the properties in the 
domain ontology to a set of SQL queries which accesses the 
appropriate data, and allows to format the data according 
to our ontology, updating the knowledge leveraged by our 
system. 

We have designed and implemented this ontology as an on- 
tology network [27]. We chose this methodology because most 
documents are classified following a hierarchy with different 
features. This model allows us to define a modular knowledge- 
guided architecture, so new document types can be added 
dynamically. ‘The ontology in our prototype has 25 classes 
and 33 properties, and information about 2 chunking proce- 
dures and 17 extraction methods. The extraction methods 
used in the experiments are mainly based on symbolic pattern 
rules due to the good performance empirically obtained with 
this methodology over this type of documents. 


3.2. AIS Pre-Process Service 


As we have presented in Section 2.3, due to several factors, 
we need to pre-process the documents to clean the text in 
order to improve the extraction performance (noisy text data 


10R2RML: RDB to RDF Mapping Language, https://www.w3.org/ 
TR/r2rml/ 
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and misspelled words hinder the extraction tasks). Thus, in 
AIS, we have implemented a text correction service which 
comprises three steps to improve the text quality before the 
extraction process: 


(1) Text Cleaning Filter: The first step of our text curation 
process consists of removing non relevant information 
from the input to get the text as clean as possible for 
the correction step. Non relevant elements are usually 
page numbers, stamps, noisy characters, etc., which 
difficult the extraction process, and they are removed 
by applying regular expressions on the text, provided 
by the ontology. 

(2) Generic Text Correction: The second step is in charge 
of correcting the misspelled words which appear in 
the text. For this purpose, we advocate to apply a 
general purpose spell checker. We have used two open 
source spell checkers: Aspell!?, and JOrtho!?. They 
have different features and performances, so we have 
combined them to get better data quality. 

(3) Specific Text Correction: Finally, we use an N-gram 
based spell checker, also generic, but specifically modi- 
fied for enhancing its performance for the domain of the 
documents to be processed. We have used Hunspell!?, 
a powerful spell checker and morphological analyzer 
which offers a good multi-language support (e.g., Eng- 
lish, French, Spanish, etc.), and we have developed our 
own N-gram libraries. For each detected error for a 
given word (wi), 1) we get the spell checker suggestions 


11 http: //aspell.net / 
!2nttp://jortho.sourceforge.net / 
13 http: //hunspell.github.io/ 
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(si), and, 2) we assign a score based on an adaptation 
of the Needleman-Wunsch [12] distance and computed 
with the actual word, and we reinforce them with the 
probabilities of the N-grams suggestions as follows: 


wordScore (wi, si) =95P (wi, si) * NG (wi) 


where gSP is the score based on the metric distance, 
and NG is the probability of wi being the next word 
in the domain where it appears. Both values are nor- 
malized in the 0-1 range. Note that our system only 
accepts words that are suggested by the spell checker, 
and gets their probability from the N-gram suggestion 
list. Besides, wordScore is never 0 as gSP(wi) always 
returns a value greater than 0, because we add perplex- 
ity to our N-gram model using the add-one Laplace 
Smoothing method [15]. 


The benefits of using this combined approach are two- 
fold: on the one hand, the general spell checker allows us to 
leverage all the general purpose techniques that are usually 
used to perform the corrections; on the other hand, the use 
of an N-gram-based model allows us to adapt them to the 
particular domain we are tackling exploiting text regularities 
detected in successfully processed domain documents. In 
this case, the word sequences which frequently appear in a 
particular document typology makes our system to be able 
to perform a highly adapted word-level correction for that 
kind of documents. For example, the sequence "notary of 
the Illustrious College of" have a high probability of being 
followed by "Madrid" in Spanish notarial documents. 


3.3 AIS Classification Service 


In our implementation of the text classification service, to 
model the document text, we have adopted a continuous dis- 
tributed vector representation for text fragments of variable 
length, from a sentence to a large document. These vectors, 
called Paragraph Vectors, are obtained through an unsuper- 
vised learning algorithm that learns sequence representations 
adopting the task of predicting words inside or within neigh- 
bouring sequences [18]. Using this technique, each document 
is mapped into a single vector. Besides, as a by-product, a 
distributed vector representation of the vocabulary in the 
domain is obtained, adapted to the particularities of the 
language use in the corpus of documents. 

After learning the domain vocabulary and document vec- 
tor representations/models, we classify each document by 
means k-nearest neighbors algorithm (k-NN), where the clos- 
est documents vote for the actual class of a given one (in 
this case, we use the cosine distance between document rep- 
resentations, and a k value of just one). Note that once we 
have enough documents to obtain an initial document rep- 
resentation space, we do not need to retrain it from scratch 
as this approach allows us to obtain representations of new 
documents without having to do so. The annotations required 
to train the document classifier are obtained by asking some 
particular users to help annotating some seed documents. 
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3.4 AIS Extraction Service 


The data extraction process is guided by the information 
represented in the ontology and it follows two steps: 1) Sec- 
tioning, to divide the text into different sections that contain 
valuable information according to the stored knowledge in 
the ontology, and 2) Extraction, to extract entities from each 
section also according to the stored knowledge. Sections and 
entities are both obtained by applying extraction operations 
over the input text. 

As shown in Figure 2, the ontology allows us to specify 
the extraction methods to be applied depending on the do- 
main and the document typology. Innerly, the operations are 
defined as tuples R =< T,C,P >, where T is the type of 
the operation, C' is the context of the operation, i.e., a set 
of variables v1, v2,...,Uy that the previous operations have 
initialized to define the current state of the data extraction 
task, and P is a list of parameters p1,p2,...,p 4 of the rule, 
with M > 0. This allows to express small IE programs which 
can be easily reused. These operations slightly follow the line 
of the Data Extraction Language (DEL); but in our case 
they are oriented to natural language. In more detail, the 
operations are the following: 


e Segmentation: Let text; be the input text, this opera- 
tion extracts the set {seg;}, i.e., the set of segments 
which conforms the input text. Such segments are de- 
fined by two regular expressions, py = regExpstart and 
pg = regExpend, Which mark the start and end of a par- 
ticular segment. If one of regExpstart or regE@pend 
is empty, the segment {sect;} is defined by [regExp,, 
regExp;| or |regExp;—1, regExp;]. This operation 
also provides different policies which allow to deal 
with isolated matches, e.g., if the matching sequence 
{start,,end,,endz} is found, the service can choose 
between forming a segment start,,end, or to extend 
it to start,, endg. 
Selection: Let text; be the input text, this operation 
extracts the set {text;;}, ie., the set of all subtexts 
of text; that match the list of regular expressions 
{regEZPpegin: regEh&pinsidels--:  regEZPinsideN } 
defining an IOB format??. 
e Replacement: Let text; be the input text, regEzppr a 
regular expression, and str a string literal. This opera- 


tion returns text, as the output text that has all the 
regExpR matches on text; replaced by the str literal. 

e Variable manipulation: Let text; be the input text, C; 
the context of the extraction process before operation 2, 
regExpc a regular expression, and n a natural num- 
ber. This operation extracts the set {text;;} of all 
subtexts of text; that match the regular expression, 
and binds the result of size ({text;;}) > n to the next 
context Cy1. 


4h ttps://www.w3.org/TR/data-extraction 

15Inside Outside Beginning (IOB) format is used for tagging tokens in 
a chunking task in computational linguistics. I indicates that a tag is 
inside a chunk, O indicates that a token belongs to no chunk, and B 
indicates that a tag is the beginning of a chunk. 
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e Flow control: Let text; be the input text, C; the context 
of the extraction process before operation 7, {L;} a list 
of sets of operations, and {Cond;} a list of both regular 
expressions reg xp; or variables v;; defined inside Cj. 
This operation evaluates {Cond;} sequentially until 
either a match of regExp; inside text; or the value of 
vjj returns true. Then, it applies the list of operations 
L,; to the input text. 


In addition, this service is able to manage regular ex- 
pressions extended with NLP analysis which are used to 
enhance the operations. These extended regular expressions 
are defined in a similar way as in NUPK=?. i.e., as a list of 
regular subexpressions regExprxp = {subRegExp;}, being 
subRegE xp, either a classical regular expression, or a custom 
regular expression which handles morphological analysis. We 
extend the capabilities of this process by integrating a Named 
Entity Recognizer (NER) [21] to identify mentions of entities 
in the text. Regarding the NER tool, we use Freeling!”, a 
well known and widely used analysis tool suite that sup- 
ports several analysis services in both Spanish and English, 
as well other languages which could be incorporated in our 
architecture in future developments. 


3.5 Data Curation Service 


Finally, after the extraction process, AIS curates the ex- 
tracted data to improve the outcome of the overall infor- 
mation extraction process. ‘This whole process is guided by 
the knowledge stored in the domain ontology, analyzing the 
extracted entities, their structure and their attributes. In 
particular, it performs the following steps: 


(1) Basic Cleaning: First, the process cleans empty fields 
and duplicated information. We dedicate the first step 
to these two types of errors due to their frequency and 
their particularly nocive effects in subsequent steps. 

(2) Data Refinement: This step is focused on refining those 
entity attributes whose content is corrupted, e.g., the 
first name of a person contains the full name of one or 
more different persons. For each attribute of each entity, 
this step searches instances stored into the ontology, 
then, if it detects that an attribute is not valid, it 
modifies the attribute with the appropriate value stored 
in the ontology. If it does not find an equivalent value, 
the process does not change it. 

(3) Validation and Enrichment: In this step, the data is 
standardized, validated, and enriched exploiting dif- 
ferent sources of information. Currently, these sources 
mainly include: 1) previously extracted and verified 
data included as instances in the ontology, and 2) avail- 
able external information services (e.g., Google Maps 
API +e) We apply this step to standardize and validate 
the value of the attributes of the entities with the goal 
of solving later duplicities. For example, the type of 


164 toolkit for building Python programs to work with natural lan- 
guage. http://www.nltk.org 

7 http: //nlp.lsi.upc.edu/freeling / 

18https:/ /developers.google.com/maps/documentation/geocoding/start 
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a street can be extracted as ’St. or Street’, but they 
are representing the same concept; or the service could 
extract from a certain postal address only the street 
and the city, but the zip code could be obtained from 
Google Maps. 

Entity Division: Once the data is validated and en- 
riched, this step: 1) detects attributes which belong to 
different entities but are assigned to a single extracted 
entity, and 2) separates such attributes assigning them 
to the appropriate ones. The detection is focused on 
the set of attributes that uniquely defines an entity 
(i.e., their key attributes), e.g., the data extracted for 
a person could include two driving license numbers. In 
order to detect potentially unified entities, the process 
searches entities in the ontology using combined key 
attributes. In case of success, it creates a new entity 
with such data, and deletes those attributes from the 
rest of entities. If no entity is found or the attributes of 
another entity remain in the result, the process creates 
a number of entities equals to the maximum number 
of occurrences of a key attribute, and each attribute is 
assigned to an entity. For this purpose, our approach 
uses an Entity Aligner module which includes differ- 
ent functions and methods to measure the similarity 
between two entities [2, 10, 13]. 

Final Review: We include this last step to remove 
those attributes or information which remain at the 
end of the process and which have not been assigned to 
any entity or validated, or do not cover the minimum 
cardinality specified in the ontology. 
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Our approach uses intensively the information stored into 
the domain ontology to support the data curation process in 
different steps, relying on its already validated information 
and trusted information services to improve the quality of 
the extracted data. Note how this approach could be easily 
adapted to other domains whenever there is available curated 
information to be leveraged. 


4 RESULTS 


To evaluate the whole pipeline, we carried out a batch of 
experiments focused on testing the overall performance of 
our IE approach. For this purpose, we used a dataset of 250 
documents within the legal domain from which we extracted 
the data by hand to obtain the baseline. To measure the 
results, we used the well-known classic measures in IE to 
test the accuracy of a system: Recall (R), Precision (P), and 
F- Measure. In Table 1, we show the results obtained in terms 
of Macro and Micro values: 


e Micro: The micro values show the accuracy taking into 
account all the possible data to be extracted, making 
no distinction among the different entities which the 
data belongs to. As we can see in Table 1, we obtained 
a value of 0.82 in F-Measure. 

e Macro: The macro values are calculated by grouping 
the accuracy values firstly per entity, and then averaged. 
It allows to give a more detailed view about all the 
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Table 1: Performance of the information extraction approach 


Metric Value 


Precision 0.73 
Recall 0.94 
F-measure (0.82 


Type Similarity 


Micro - 


Precision 0.82 
Recall 0.84 
F-measure 0.83 
Precision 0.79 
Recall 0.82 
F-measure 0.80 
Precision 0.71 
Recall 0.73 
F-measure 0.72 


Macro 75% 


Macro 90% 


Macro 100% 


extracted entities and their values. Moreover, in some 
scenarios, a little noise in the extracted is acceptable, 
and we wanted to show how our approach behaves. For 
this purpose, we used three values depending on a sim- 
ilarity threshold (75%, 90% and 100%). This similarity 
is computed using the Needleman- Wunsch |12] metric 
which allows us compare the degree of correctness of 
an extracted entity. In Table 1, we can see the results 
obtained for each threshold. 


These results are obtained using the whole pipeline. In a 
more detailed analysis of each step, we measured that: 


e The document pre-processing allows to improve the hit 
and miss results in a about a 3%), a slight improvement, 
but we have to take into account that we work on top 
of general spell checkers that have been thoroughly 
trained and tested (it is really robust). Regarding false 
positives and negatives, the N-gram blocks corrections 
that are wrongly proposed by spell checkers, achieving 
a reduction of 13% in false positives, at the cost of an 
increment of just a 0.5% in false negatives, which can 
be considered a good result. 


e Applying the classification service and the post-processing 


guided by the ontology, we obtained an improvement 
in F-measure of about a 12%. 


5 RELATED WORK 


Information extraction (IE) is the process of scanning text 
for obtaining specific and relevant information, including 
extracting entities, relations, and events. Within the broad 
spectrum of approaches to this generic problem, we could 
situate our work as a rule-based system guided by an ontology. 

The use of ontologies in the field of Information Extraction 
has increased in the last years. Thanks to their expressiveness, 
they are successfully used to model human knowledge and to 
implement intelligent systems. Systems that are based on the 
use of ontologies for information extraction are called OBIE 
systems (Ontology Based Information Extraction) [31]. The 
use of an ontological model as a guideline for the extraction 
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of information from texts has been successfully applied in 
many other works [3, 16]. 

While statistical machine learning is widely used as a 
black-box technology regarding transparency of decisions, 
rule-based approaches follow a mostly declarative approach 
leading to explainable and expressive models. Other advan- 
tages of such rule languages include readability, and maintain- 
ability and foremost the possibility of directly transferring 
domain knowledge into rules [30]. The potential of directly 
including the knowledge of a domain expert into the IE pro- 
cess rather than choosing the most promising training data, 
hyper-parameters or weights, is a major advantage compared 
to other approaches to IE. 

To the best of our knowledge, there do not exist other works 
focused on creating a generic architecture/methodology dedi- 
cated to the specific extraction of information on documents. 
Besides, it is hard to locate information about extraction 
systems equipped with robust extraction methods in the pres- 
ence of OCR problems due to overlapping marks. Focusing 
on particular domains, in the recent years, we can find some 
works in the legal [6, 8, 9, 23], in the industry [19, 25, 26], 
and medical domains [22, 29], in which the use of ontolo- 
gies and rules can be deemed similar to ours. The main 
difference between these aforementioned works and ours is 
threefold: 1) we aim at a flexible architecture adaptable to 
any domain, 2) our proposed ontological model captures both 
the nature of the entities and the formal structure of docu- 
ments in order to boost the extraction process, and 3) ICIX 
performs an integral improvement of the extraction results by 
providing a complete end-to-end pipeline: previous cleaning 
and correction processes, use of automatic classifiers enabling 
the suitable extraction tools, the post-verification process, as 
well as the presence of satellite validation and help tools. 


6 CONCLUSIONS 


In this paper, we have presented a generic architecture to 
carry out an automatic data extraction from documents. Our 
extraction process is guided by the modeled knowledge about 
the structure and content of the documents. This knowledge 
is captured in an ontology, which also incorporates infor- 
mation about the extraction methods to be applied in each 
section of the document. The service-oriented architecture 
approach we have adopted makes our architecture flexible 
enough to incorporate different techniques to perform data 
extraction via invoked methods. The ontology has been up- 
graded by storing additional domain knowledge in order to 
detect and solve consistency errors in the extracted data. 
Besides, we have endowed the architecture with a pre-process 
service with the aim of correcting OCR error and minimizing 
the presence of annoying overlapping elements in the docu- 
ment, an automatic classifier which optimizes the choice of 
extraction methods to use. 

The knowledge-based nature of our approach, as it is 
completely guided by the domain ontology, enhances its 
flexibility and adaptability, and ensures its portability to 
other use cases. In particular, its application to other business 
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contexts would only require to adapt the data and knowledge 
expressed by the ontology; the rest of the system would 
remain unchanged. 

To show its flexibility and feasibility, we have detailed the 
AIS system, the implementation of our architecture for the 
legal domain. The preliminary experiments we have carried 
out suggest that exploiting knowledge to guide the extraction 
process improves the quality of the results obtained, obtain- 
ing good results regarding precision and efficiency. Despite 
the fact that experimental dataset is composed of Spanish 
legal documents, our approximation is generic enough to be 
applied to documents in other languages. The good results 
achieved so far lead to optimism regarding the interest of 
this architecture. 

As future work we want to apply our approach to other 
domains apart form legal one, extending the experiments to 
other fields. Regarding the architecture, we want to explore 
the incorporation of external web services to the platform, 
that allow expanding the functionalities of the approach. 
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ABSTRACT 


Data Science as a multidisciplinary discipline has seen a massive 
transformation in the direction of operationalisation of analysis 
workflows. Yet it can be observed that such a workflow consists 
of potentially many diverse components: like modules in different 
programming languages, database backends, or web frontends. In 
order to achieve high efficiency and reproducibility of the analysis, 
a sufficiently high level of software engineering for the different 
components as well as an overall software architecture that inte- 
grates and automates the different components is needed. For the 
use case of gene expression analysis, from a software quality point 
of view we analyze a newly developed web application that allows 
user-friendly access to the underlying workflow. 


CCS CONCEPTS 


- Software and its engineering — Requirements analysis; Main- 
taining software; - Applied computing — Computational tran- 


scriptomics; Bioinformatics; 


KEYWORDS 


Data Science workflow, Gene expression analysis, Software quality, 
Web service 
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1 INTRODUCTION 


Data has evolved to become one of the most important assets for 
any business or institution in recent years and constitutes one of 
the key factors driving current innovations. For example, the steep 
spike in technologies for personalized medicine enabled by wide 
applications of genome or transcriptome analysis are facilitated 
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by a constant collection of (big) data. Workflows in a Data Science 
context refer to the processing of data from its raw form to the 
finished interpretation and result visualization. Such a workflow 
includes steps such as cleaning, feature engineering, model genera- 
tion, and validation. In order to fully operationalize the workflow, 
this process may also include the deployment of the workflow in 
an interactive webservice. 

As a typical use case for an analysis workflow we address gene 
expression analysis. Gene expression analyses are important in 
areas such as drug development and tumor research. The chal- 
lenge of analyzing genome and transcriptome data volumes has 
become increasingly important in recent years and cover aspects 
of exploratory data analysis, statistics and machine learning. Since 
scientists performing gene expression experiments usually lack sta- 
tistical and computer science knowledge to handle those amounts 
of data, usable interfaces are needed. To enable an easy analysis of 
the data, we designed and implemented a web service that abstracts 
from the mathematical details and any programming tasks. The web 
interface was implemented using the R Shiny framework which 
was chosen because R libraries are widely used for gene expres- 
sion analysis. Yet so far the R Shiny framework itself has not been 
thoroughly evaluated regarding its capability for implementing 
complex data science applications. 

As the main goal of this article we provide an evaluation of our 
web application in terms of software quality. More precisely, our 
focus lies on comparing the maintainability of modularized versus 
non-modularized Shiny applications as exemplified by our gene 
expression web application. 

A minor secondary goal of the article is to cover the combina- 
tion of aspects with respect to workflows and DevOps by briefly 
describing an implementation of an IT system infrastructure that 
serves as our basis for a flexible deployment of the data analysis 
workflows and web applications. We believe that a consideration 
of DevOps paradigms can simplify development and operations of 
typical Data Science workflows and enables a high quality in terms 
of reproducibility, reusability and automation be achieved with low 
maintenance effort. 


Outline. The paper is structured as follows. Section 2 describes 
the gene expression analysis workflow as a use case. Section 3 in- 
troduces the topic of software quality. Section 4 summarizes our 
requirements analysis and presents related work. Section 5 dis- 
cusses our approach towards modularity of our application. Section 
6 presents general design principles and analyzes their analogies 
in Shiny. Section 7 gives an in-depth analysis of our application 
based on quality metrics. Last but not least, Section 8 touches upon 
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aspects of a real-world deployment before concluding the article in 
Section 9. 


2 USE CASE: GENE EXPRESSION ANALYSIS 


Gene expression is defined as the process of using the sequence of 
nucleotides of a gene to synthesize a ribonucleic acid (RNA) mole- 
cule, which in turn triggers the synthesis of a protein or performs 
some other biological function in a cell. As a result of this process, 
the nucleotides’ sequence determines the biological information of 
a gene [6]. 


2.1 Data Format 


We focus on development of a web service to analyze gene expres- 
sion data generated with Affymetrix microarray chips. Microarrays 
are a collection of genes or cDNAs arranged on a glass or silicon 
chip [4]. DNA molecules are located on spots or features on the 
microarray chips’ silicon surface. On each feature are a few million 
copies of the same section of a DNA molecule. This section can be 
associated with a specific gene and is in Affymetrix chips composed 
of 25 nucleotides. cDNA molecules get labeled with a fluorescent 
dye and form the target molecule of the experiment. Finally, the 
intensity of the fluorescence of the individual spots is measured 
using special lasers. The more the oligonucleotides in the spots are 
hybridized, the stronger the emission of light will be. An image is 
generated from the intensity of the individual spot’s fluorescence 
and stored for further analysis. The generated image is kept in the 
data (DAT) file format, which contains the measured pixel intensi- 
ties as unsigned integers. A so-called CEL file summarizes the DAT 
file and is used in the further analysis. 

Beyond the measurement data, additional metadata annotations 
have to be stored. The MicroArray Gene Expression Tabular (MAGE- 
TAB) specification is a standard format for annotating microarray 
data, which meets MIAME requirements. MIAME is a recommen- 
dation of the Microarray Gene Expression Database society. It is 
a list of information that should be provided at a minimum when 
microarray is published to allow description and reproduction [5]. 
MAGE-TAB defines four different file types: Investigation Descrip- 
tion Format; Array Design Format; Sample and Data Relationship 
Format (SDRF); raw and processed data files. For the subsequent 
analysis, the SDRF file is of particular importance because informa- 
tion about the relationships between the samples, arrays (as CEL 
files), and data of the experiment is contained. 


2.2. Workflow 


The data analysis for a gene expression experiment is usually con- 
ducted by following a specific workflow. After performing a gene 
expression experiment, first of all the quality of the resulting data 
must be verified using different metrics. If the quality is insufficient, 
some data may need to be removed from the analysis or parts of 
the experiment may need to be repeated [13]. If the quality of the 
data is sufficient, the data needs to be preprocessed. This process 
consists of three steps: Background correction to remove influences 
on the data that have no biological cause; normalization to make 
individual samples comparable; and summarization to calculate one 
value from multiple measured values of the activity of the same 
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Data Prepro- DEG 
upload cessing comparison 


Figure 1: Workflow for analyzing differentially expressed 
genes (DEGs) 


gene [13]. Finally, the actual statistical data analysis is performed 
to answer the research question [15]. 

We present now the workflow from the user’s point of view by 
detailing all steps followed by a user in our web application. These 
steps are initially derived from the MaEndToEnd workflow [17], yet 
several customizations had to be added as requested by the endusers 
of our web service. In Figure 1, the basic steps are visualized: Data 
upload, preprocessing, DEG analysis, DEG comparison and gene 
ontology (GO) analysis. 


(1) The workflow starts with the upload of the raw input files. 
For this, the user has to define the source of the files, cur- 
rently either the local file system, a SOL database connection 
or the GEO repository [10] are available as options. It is pos- 
sible to filter the samples from the input data since not all 
samples may be of interest. 

(2) As the first step of the preprocessing, the uploaded files 
can be filtered for their so-called gene chip types. An all-in- 
one pre-processing algorithm can be selected by the user, 
which is then applied to the data. It performs the steps of 
background correction, normalization, and summarization. 
Afterwards, the user can plot the data in various ways for 
quality control. 

(3) Next follows the actual gene expression analyses. A his- 
togram of the distribution of the p-values is plotted for each 
analysis, where it is expected that the frequency ranges be- 
tween very high near 0 and low towards 1. The user verifies 
this. If the histogram does not meet the expectations, the 
workflow ends at this point. Possible faults could be incor- 
rect data, incorrect pre-processing or improperly defined 
analyses. If the histogram meets the expectations, the next 
step is to set the significance thresholds. These define ranges 
of statistical values calculated during the analysis in order 
to identify differentially expressed genes (DEGs). 

In the next step of the workflow, it is possible to compare 

the performed gene analyses with each other. For the com- 

parison, different plots and Venn diagrams should be created 
automatically. 

For classifying the identified differentially expressed genes 

(DEGs) in the biological context, gene ontology (GO) [7] 

based enrichment analyses are performed in our web ser- 

vice. This offers a uniform vocabulary, which applies to all 
eukaryotes. 


(4 


—S 


(5 


—S 


Our web application offers a separate pane for each of the above 
workflow steps. 


3 SOFTWARE QUALITY 


As already described, the importance of data analysis applications is 
increasing. Whether a framework is suitable for the development of 
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complex applications depends on many factors. We aim to analyse 
our web service according to standardized software quality metrics. 
The term software quality is defined by the International Organiza- 
tion for Standardization/International Electrotechnical Commission 
(ISO/IEC) as the “degree to which a software product satisfies stated 
and implied needs when used under specified conditions” [16]; see 
also [12] for a recent survey. The ISO/IEC 25000 norm represents 
an international standard entitled “Systems and software engineer- 
ing — Systems and software Quality Requirements and Evaluation 
(SQuaRE)” including a section on System and software quality mod- 
els. A distinction is made between the following quality models: 
Quality in Use and Product Quality. We will focus here on Product 
Quality that is defined as “characteristics [...] that relate to static 
properties of software and dynamic properties of the computer 
system” [16]. 

We now introduce a selection of these characteristics that we 
later consider in the assessment of our software. 


(1) Functional suitability consists of the sub characteristics func- 
tional completeness, appropriateness, and correctness. It 
addresses the effectiveness of the software product and is 
assessed for our software by a comprehensive requirements 
analysis and subsequent evaluation in user tests. 

(2) Usability includes appropriateness recognizability, learnabil- 
ity, operability, user error protection, UI aesthetics, and acces- 
sibility. Thereof, appropriateness recognizability is defined 
as “the degree to which users can recognize whether a prod- 
uct or system is appropriate for their needs”. Usability of 
our software is also evaluated in user tests and feedback 
interviews. 

(3) Maintainability is defined as “the degree of effectiveness and 
efficiency with which a product or system can be modified 
by the intended maintainers”. It includes the sub characteris- 
tics modularity, reusability, analyzability, modifiability, and 
testability. The sub characterizations are explained further 
because they are considered in more detail to evaluate our 
Shiny framework later on with quantitative metrics. 

(a) Modularity describes the separation of the software into 
components, where changes in a component should affect 
dependent components as little as possible. 

(b) Reusability means the use of assets in several software 
systems or with building new assets. Assets are defined as 
“work products such as requirements documents, source 
code modules, measurement definitions, etc.” [16]. 

(c) The sub characteristic analyzability describes the under- 
standing of the interrelations in the software product, for 
example, the impact a change would have. 

(d) Modifiability describes the extent to which software can be 
changed without degrading quality or introducing errors. 
According to ISO/IEC, the modifiability can be impacted 
by modularity and analyzability. 

(e) Finally, testability describes the possibilities of setting up 
test criteria and verifying their compliance. 

(4) Compatibility (within the characteristic co-existence) de- 
scribes the impact on other software running on the same 
platform as well as (within the characteristic interoperabil- 
ity) the exchange of information with other software. These 
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aspects are considered in our software since input and out- 
put file formats are supposed to be interchangeable with 
other software products by using standardized data formats, 
allowing access to public repositories (like Gene Expression 
Omnibus) as well as offering a versatile database system 
access. 


We mention in passing that the Product Quality model [16] 
also covers other characteristics which are however out of the 
scope of this paper. The characteristics of performance efficiency 
describes the properties time behavior, resource utilization and ca- 
pacity and depends on the used platform or environment. Reliability 
describes the maturity, availability, fault tolerance and recoverabil- 
ity of the software. Security includes the confidentiality, integrity, 
non-repudiation, accountability and authenticity of the software, 
which will not be considered in this article since the application is 
only accessible to researchers inside the institute. Finally, portabil- 
ity includes the adaptability, installability and replaceability of the 
product; while not quantitatively analyzing portability, we address 
this issue in Section 8. 


4 REQUIREMENTS ANALYSIS AND 
COMPARISON TO RELATED WORK 


For the implementation of the web service, the software require- 
ments have to be specified first. During the complete development 
process, new requirements were identified, resulting from litera- 
ture search and interviews with future users of the web service. As 
already mentioned, various requirements were derived from the 
MaEndToEnd workflow. Requirements were identified and docu- 
mented during the entire development process. 

A distinction between functional and non-functional require- 
ments is made. Examples of non-functional requirements are the 
system’s availability or safety aspects. The requirements are graded 
into high, medium and low concerning their relevance for imple- 
mentation [24]. We identified 30 functional requirements and 4 
non-functional requirements (which cannot be fully reproduced 
here in detail due to space restrictions); out of the functional re- 
quirements 13 and out of the non-functional requirements 2 were 
graded as high. 

Because of the relevance of gene expression analyses, many tools 
for those are already available. For justifying the development of a 
new tool, the following paragraphs will give a short presentation 
of a selection of existing tools and show the differences between 
the defined requirements and the functionalities of these tools. The 
Transcriptome Analysis Console (TAC) software [1] is presented 
because Fraunhofer ITEM researchers currently use it for gene 
expression analyses. The other presented tools were selected based 
on a literature search; two of them were chosen because their func- 
tionalities and the defined requirements intersected as closely as 
possible: GEO2R [20] and Network-Analyst [25]. In particular, only 
tools that are open source and offer a graphical UI were selected. 

The TAC software [1] is provided by Affymetrix Inc., the manu- 
facturing company of the microarray chips. The gene expression 
analyses are mostly performed with the R packages provided by 
Bioconductor [14]. According to our requirements analysis we iden- 
tified the following shortcomings of the software: Uploading or 
creating SDRF files is not possible, but definable attributes provide 
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the functionality. Furthermore, it is possible to compare several 
analyses, but not to the required extent. An installation of the TAC 
software on the client is mandatory. The significance values cannot 
be set individually for the single DEG analyses. Also, performed 
analysis steps can only be traced by manually repeating the steps. 

GEO2R offers an option for gene expression analysis available 
on the website of the GEO database [10]. According to our require- 
ments analysis we identified the following shortcomings of the 
software: Local files and SDRF files cannot be uploaded. Further- 
more, only one biological group per sample can be specified, which 
limits the possible analyses. GEO2R does not offer the ability to 
generate PCA plots and heat maps or filter the samples after qual- 
ity control. Multiple DEG analyses can only be defined indirectly 
because all defined biological groups are automatically compared. 
For these, only similar significance thresholds can be defined. A 
comparison of several analyses is only offered for analyses made 
between samples of the same series and only via one Venn diagram. 

NetworkAnalyst [25] is a free web-based tool for gene expression 
analysis that offers DEG analyses and network analyses. According 
to our requirements analysis we identified the following shortcom- 
ings of the software: Files in CEL format, as well as SDRF files, 
cannot be uploaded. However, the functionality of defining biolog- 
ical groups can be achieved by specifying the factors elsewhere. 
For preprocessing, the tool only applies the filtering of genes and 
the normalization step. Background correction, as well as summa- 
rization, does not seem to be performed. Moreover, the filtering of 
samples after quality control is not possible. Multiple DEG analyses 
can be defined, but not as precisely as required and individual sig- 
nificance thresholds can only be defined for the adjusted p-value. 
Several analyses can be compared, but not to the requested level. 

Overall, no tool completely meets the requirements graded “high”. 
Especially the comparison of analyses is only possible to a limited 
extent. The development of an own application for gene expression 
analysis within our institute offers further advantages. Data that 
has not yet been published can be analyzed without any concerns 
about data protection. In contrast, the security of third-party tools, 
especially if they are publicly available on the internet, has to be 
verified first. Moreover, if new requirements arise, the software can 
be flexibly extended. 


5 SHINY MODULARIZATION 


The programming language R provides an environment for the 
analysis and graphical visualization of data [22]. R was chosen 
for our implementation because the open-source software project 
Bioconductor provides many R packages for the analysis of gene 
expression data. Although these could also be used in other pro- 
gramming languages, the application’s complexity can be reduced 
by limiting it to one programming language. This has a positive 
impact on the maintainability of the software. With R Shiny, a 
framework for the development of web-based applications directly 
in R is provided. An application is built from two main components: 
A UI object and a corresponding server function, which accesses 
the UI elements via their defined IDs. Also, the server function 
defines the logic for the functionality of the web app, for example, 
processing the data or plot variables. The elements defined in the 
UI object are either input or output elements. Values of the input 
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elements are processed in the server function code, while the output 
elements visualize the results of those calculations. Moreover, Shiny 
uses a reactive programming model. Reactivity in Shiny means that 
when the input values are changed, the code sections that access 
the changed input elements are automatically re-executed. Thus, 
the corresponding output elements are also automatically updated. 

As mentioned in Section 4, for the implementation of our web 
service, we first specified software requirements by close commu- 
nication with the end users. During the complete development pro- 
cess, new requirements were identified, described, and evaluated. 
The process was thus iterative. A first prototype of our framework 
hence resulted in a monolithic source code. 

Defining modules should improve the handling of complex Shiny 
applications. For this purpose, functions are used, which are the 
fundamental unit for abstraction in R. Modules consist of functions 
that generate UI elements and functions used in the module’s server 
function. There is a global namespace within a Shiny application, 
so each ID of input or output elements must be unique within the 
app. By using functions for generating UI elements, the uniqueness 
of the IDs must be ensured. Shiny modules solve this problem by 
creating an abstraction level beyond functions. 

Shiny modules have three basic characteristics: 


(1) They cannot be executed alone but are always part of a larger 
application or module. The nesting of modules is therefore 
possible. 

(2) They can represent input, output or both. 

(3) They are reusable: both in several applications and several 
times in the same application. 


For creating a module, one function is needed that defines UI ele- 
ments and one function that uses these elements and contains the 
server logic. The UI function expects an ID as a parameter, which 
defines the namespace of the module. The caller of the function 
defines this. The namespace is applied to all IDs of the input and 
output elements for ensuring the correct mapping of the elements 
to the namespace. This ID as namespace must also be passed to the 
server function of the module. In this, the moduleServer function 
is invoked, where the actual server logic is defined. The objects 
input and output are aware of the namespace, so the elements with 
ID wrapped in the defined namespace can be referenced using the 
standard way input$elementID. UI elements outside the namespace 
cannot be addressed at all. In the regular components of the Shiny 
application, the UI function of the module is called in the UI object 
and the server function of the module in the regular server function. 
It is essential to pass the same ID to the corresponding functions. 

In order to improve maintainability, we refactored the monolithic 
version of our app into a modularized one. Figure 2 illustrates an 
overview of the defined main modules and their sub modules of 
the web application. One module was defined for each main step of 
the workflow, resulting in the following main modules: dataUpload, 
preprocessing, degAnalysis, degComparison, and goAnalysis. Each 
of these main modules are realized within a separate tab/panel in 
the web application. Each of the main modules accesses at least one 
sub-module with more specific responsibilities. 
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rawData, factomames, 
metadata, 
inishedDataUpload 


rawData, factornames, 


metadata 
prepData, 


finishedPreprocessing 


~~ degResults, prepData 


goResults, 
finishedGoAnalysis 


degResults, prepData, factornames 
finishedDegAnalysis 


degResults 


Figure 2: Modularized application 


maintai- | sub characteristics 
nability analys- | modifi- 
sty 

encapsulation | T 
coupling l i 
cohesion T ih 
size l 
complexity J 


Table 1: Influence of OO design principles on the sub char- 
acteristics of the software quality property maintainability; 
an arrow pointing up (1) indicates a positive influence of the 
design property on the quality (sub) characteristic and an ar- 
row pointing down (|) indicates a negative influence. 


6 SOFTWARE DESIGN PRINCIPLES 


This article focuses on the evaluation of the maintainability of 
Shiny applications. The focus is on the maintainability of the Shiny 
components that build and control the user interface (UI). As al- 
ready specified, maintainability consists of the sub characteristics 
modularity, reusability, analyzability, modifiability, and testability. 
The latter is not considered in this evaluation of Shiny, as this is 
a very extensive topic and would exceed the scope of this paper. 
In the following, a relationship between these sub-characteristics 
and design principles from the object-oriented (OO) programming 
paradigm is first established. In this section, the term application 
refers to a Shiny web app and the term module refers to a Shiny 
module. 


6.1 Influence of OO design principles on 
maintainability 


When designing OO software products, several design principles are 
defined, supporting the enhancement of the software quality. In the 
following, a selection of the design principles defined by Schatten 
et al. [21] is presented and assumptions about their influence on the 
sub characteristics of maintainability are made. Only the principles 
relevant to the metrics calculation in Section 7 are considered. The 
assumptions about the impact on maintainability are based on the 
results of a literature review and the comparison of the definitions of 
the design principles to those of the maintainability characteristics. 

Table 1 indicates with upward and downward arrows an overview 
of our assumptions made for the influence of the design principles 
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on the software quality; that is, ideally higher encapsulation, looser 
coupling, higher cohesion, and less extensive and less complex soft- 
ware. We discuss our assumptions on software design principles in 
detail: 


Encapsulation. With encapsulation, details of the implementa- 
tion should be hidden from the outside, realized in the OO paradigm 
by private fields and methods. Communication between compo- 
nents should occur only via defined interfaces, which are public 
fields and methods. For example, this enables to change the en- 
capsulated functionality of a component, without affecting the of 
calling components as long as the interface remains unchanged. 
The definition’s similarity of the sub characteristic modularity of 
maintainability is noticeable [16]. Therefore, the assumption is 
made here that higher encapsulation also increases modularity, 
reusability, modifiability, and maintainability overall. 


Coupling. Coupling describes the dependency of two compo- 
nents, whereby in general, the lowest possible coupling should be 
achieved. Thus, the effects of changes of a component are reduced 
in the dependent components [21]. Also, tight coupling compli- 
cates the analyzability of the software. Therefore, with increasing 
coupling, the maintainability of the software decreases. 


Cohesion. Cohesion describes the degree to which the elements 
of a component are interrelated. As cohesion increases, the cou- 
pling between the individual components usually decreases [21]. 
Therefore, rising cohesion generally supports the maintainability 
of the product. 


Size and Complexity. In the following, also metrics are consid- 
ered, which are assigned to the design properties size and complex- 
ity. The assumption is made that with increasing size, especially 
the software’s analysability and modifiability deteriorates while 
increasing complexity also impairs reusability. 


6.2 Transfer of OO concepts to Shiny 


Our aim is to analyze our Shiny app according to widely used OO 
software quality metrics [18]. Because in general the application 
modules in Shiny do not fully follow the the OO paradigm, some 
assumptions have to be made to calculate the metrics. For this 
purpose, a definition of the fields and methods of different encapsu- 
lation levels is made for the Shiny framework. Only the elements 
of a class that are relevant in the metric selection of Terragni et al. 
[23] are considered. Hence, for transferring the class elements of 
the OO paradigm to the Shiny framework in a meaningful way, the 
class elements, their functionality and possible encapsulation levels 
are presented next. 

An object or instance has a defined state (defined by attributes 
or fields) and a defined behavior (defined by methods). In general, 
the behavior depends on the state, which means that the meth- 
ods access the attributes. A class defines for a collection of objects 
the structure, the behavior and the relationship to other classes. 
Static elements of a class are the same for all instances of that 
class. Static methods can only be applied to the whole class and not 
to a single instance. Usually, they are used to assign a new value 
to class attributes without the influence of an instance or when- 
ever an operation applies to all or several instances. The following 
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OO paradigm Shiny framework 


Fields 
Fields with equal value for all instances; 
lockBinding in the environment 
Reactive Values (including Input object) 
defined inside application or module 


Static fields 


Private fields 


Protected fields - 
Fields declared in the R script outside of 
Puble elds functions; not locked in the environment 
Methods 
Static methods - 


Functions defined inside server function 
Reactive expressions 

Observers (including render functions) 
Protected methods | - 

Functions defined in the R script outside 
of the server function 


Private methods 


Public methods 


If application viewed: only server function 


If module viewed: UI and server function 


Table 2: Mapping of OO class components to Shiny applica- 
tions and modules 


three levels of encapsulation are differentiated: private elements 
are accessible only within the class itself; protected elements are 
accessible in the class itself, in its subclasses and, depending on the 
programming language, in classes within the same package; public 
elements are accessible by all classes. The following statements 
can be made by transferring these definitions to the Shiny frame- 
work. Instances of a Shiny application or module are distinguished 
from each other by using different namespaces. An instance of an 
application or module also consists of a state and behavior. The 
state is primarily defined by input elements used by the defined 
behavior to calculate the output elements. The definition of the 
application or module determines the state and the behavior for a 
collection of instances. Static fields should have the same value for 
all instances of an application or module. Static functions should 
be applicable to all instances. The latter is not implementable for 
Shiny applications and modules because the principle of object 
management is not realized by default. However, the object man- 
agement could be achieved using static attributes, but this was not 
implemented in the exemplary software, so that static methods in 
the Shiny framework will not be considered in further discussion. 
The distinction of instances based on the namespace allows two 
stages of encapsulation to be defined: Private elements are only 
accessible within their own namespace, and public elements are 
also accessible outside the namespace. The encapsulation level pro- 
tected of the OO paradigm has no correspondence in Shiny since 
inheritance is not implementable in this framework. 

We now discuss how the individual elements from the OO para- 
digm are mapped to elements of applications and modules of the 
Shiny framework base on these assumptions; an overview of this 
mapping is provided in Table 2. 

It is assumed that private fields correspond to reactive values 
defined within the application or module in the Shiny framework. 
Reactive values are used in the defined behavior of the application 
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or module to generate the output. Moreover, they are not accessible 
outside the namespace of the application or module. Fields that are 
defined outside of a function in the script of an application or mod- 
ule and are not locked in the global environment are seen as public 
fields. These elements can also be addressed outside the namespace 
itself. They might also be assumed as static, but any instance can 
change the value of the field. This may not affect all other instances 
of this application or module that already exist, especially in other 
sessions. Therefore, only fields with the same value for all instances 
and are locked in the environment via the lockBinding function 
are assumed to be static fields. A field has the same value for all 
instances if the value does not depend on any reactive values or 
method parameters of the functions. Locking the field prevents the 
assignment of a new value. Thus, the definition of static variables 
in the Shiny framework is stricter than in the OO paradigm, but 
otherwise, the equality of the field’s value for all instances cannot 
be ensured. Private functions in the Shiny framework correspond 
to functions defined within the server function of an application 
or module, reactive expressions and observers, including render 
functions. These functions define the behavior of the application 
or the module by calculating the output in dependency of the state 
of an instance. Functions defined in the server function are not 
accessible outside of the namespace of the application or module. 
Observer and render functions can only be defined inside other 
functions, so they are automatically not callable from the outside. 
Also, they are not explicitly invoked in the program code anyways 
but are executed based on reactivity. Finally, public functions are 
functions defined in the R script outside of any other functions. For 
Shiny applications, this does not include the server function since 
this cannot be called outside the application in any meaningful 
way, except to create a new instance. For Shiny modules, the UI 
and server functions are also included as they provide the module’s 
interface and are thus called by other modules or applications. 


7 SOFTWARE QUALITY ANALYSIS 


This section examines how the software quality characteristic main- 
tainability, which was defined in Section 3, is affected by the mod- 
ularization of Shiny applications by comparing modularized and 
non-modularized version of our application. With the help of the 
calculation of software quality metrics, the effect of the modulariza- 
tion of Shiny Apps on maintainability is considered quantitatively. 
These metrics are mostly derived from the OO paradigm, so they 
must be partially adapted to the Shiny paradigms. As such, a formal 
evaluation of the use of the Shiny framework for the development 
of larger, more complexly structured applications is made. We rely 
here only on very specific metrics for a static code analysis - which 
is an established method [9, 11, 23] - due to the fact that we com- 
pare two versions of our application with a fixed feature scope. An 
advanced assessment could also use a cost model based on change 
rate of the code; in [2] the authors argue that this cost model enables 
the prediction of future development costs. 


7.1. Metrics calculation 


Based on the correspondences between Shiny and the OO para- 
digm just discussed, the effect of modularizing Shiny applications 
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(see Section 5) is examined. For this purpose, the presented ap- 
plication for gene expression analysis was implemented from an 
earlier development state both with modularization and without 
(monolithic). Any comments or other code documentation were 
removed in the source codes of both applications to increase their 
comparability; this is based on the assumption that differences in 
comments do not influence maintainability. Since both applications 
represent the same functionality, the actual calculation logic was 
abstracted into functions called by both applications. In the fol- 
lowing, these methods are referred to as functionality methods 
or functions. Thus, the source code to be compared contains as 
far as possible only instructions concerning the UI as well as its 
control. For the quantitative comparison, metrics were calculated, 
which were mostly originally developed to evaluate classes of OO 
programming languages [23]. First, it will be discussed which of the 
metrics will not be calculated. The number of bytecode instructions 
(NBI) was used in [23] because applications developed in Java were 
considered. Compiling R program code into bytecode is possible, 
but an interpreter is used by default. Therefore, this metric is not 
calculated. We compare both our applications (monolithic versus 
modularized) without considering the code documentation, thus it 
would not be useful to calculate the lines of comment (LOCCOM) 
metric. The number of static methods (NSTAM) cannot be calcu- 
lated since it is not possible or reasonable to declare a method of an 
application or a module as static in Shiny. Moreover, the concept of 
inheritance is not implementable in Shiny applications or modules, 
so any metrics that are based on inheritance cannot be calculated. 
This includes the depth of inheritance tree (DIT), the number of 
children (NOC), the measure of functional abstraction (MFA), the 
inheritance coupling (IC) and the coupling between methods (CBM). 
There is no equivalent in the Shiny framework to the encapsula- 
tion level protected used in the OO paradigm, as explained earlier. 
Therefore, the number of protected methods (NPROM) cannot be 
calculated. 

Next, the metrics that were calculated for the sample applications 
are presented. The details of how they were calculated will also be 
discussed. Table 3 provides an overview of the calculated metrics 
and a brief description. 

The lines of codes (LOC) metric counts the number of non-blank 
lines of the application or the module. The IntelliJ Plugin Metric- 
sReloaded was used for the calculation of the LOC. This metric 
is used to evaluate the design property size and, therefore, has a 
low value. This applies to all metrics of this design property. The 
number of public methods (NPM), the number of fields (NOF) and 
the number of static fields (NSTAF) were calculated by counting 
the number of corresponding elements in the application or the 
module. Terragni et al. do not define which encapsulation should 
be considered for the calculation of the NOF. It is assumed that only 
public fields are counted since a separate metric exists for private 
fields. The number of method calls (NMC) is the sum of the number 
of method calls internal (NMCI) and the number of method calls 
external (NMCE). Thereby, the NMCI was calculated by counting 
all invocations of a function defined in the application or module 
itself. The NMCE is the number of all function invocations defined 
in another module or an external R package. Also, the calling of 
functionality methods is included in this metric. 
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Design Property Name 


Lines of Code (LOC) Number of non-blank lines 


Number of Public Meth- 
ods (NPM) 


Description 


Number of public functions in 
an application or module 


Number of Fields (NOF) | Number of public fields in an 
application or module 


Number of Static Fields 
(NSTAF) 


Number of static fields in an ap- 
plication or module 


Number of Method 
Calls (NMC) 


Number of function invocations 


Number of Method 
Calls Internal (NMCI) 


Number of Method 
Calls External (NMCE) 


Weighted Methods per 
Class (WMC) 


Average Method Com- 
plexity (AMC) 


Response For a Class 
(RFC) 


Number of function invocations 
of function defined in the appli- 
cation or module 

Number of function invocations 
of function defined in other 
modules or packages 

Sum of the Cyclomatic Com- 
plexity of all function in the ap- 
plication or module 

Average of the Cyclomatic Com- 
plexity of all function in the ap- 
plication or module 

Number of functions that re- 
sponse to a message from the 
application or module itself 


19 
a 
"o 
5 
cS) 
Oo 


Coupling Between Ob- 
ject classes (CBO) 


Afferent Coupling (Ca) 


Number of other modules or 
packages that an application or 
module is coupled to 

Measure of how many other ap- 
plications or modules use the 
specific application or module 


Efferent Coupling (Ce) 


Lack of Cohesion in 
Methods (LCOM) 


Measure of how many other 
modules or packages are used 
by the specific application or 
module 

Difference between the number 
of function pairs without and 
with common non-static fields 


Lack of Cohesion Of 
Methods (LCOM3) 
Cohesion Among Meth- 
ods in class (CAM) 


Data Access Metrics 
(DAM) 


Number of Private 
Fields (NPRIF) 
Number of Private 
Methods (NPRIM) 


Revised version of LCOM 


Represents the relatedness 
among functions of an applica- 
tion or module 

Ratio of the number of private 
fields to the total number of 
fields 


Number of private fields of an 
Number of private functions of 


Table 3: Descriptions of the calculated (class) metrics [23] 
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Next, metrics for quantitative assessment of the design property 
complexity are presented. For all metrics, a low value corresponds 
to low complexity. The weighted methods per class (WMC) is cal- 
culated by the summation of the Cyclomatic Complexity [19] of 
all functions defined in the application or module, which counts 
the linear-independent paths within a program. The R package 
cyclocomp was used to calculate the Cyclomatic Complexity of a 
single function. The metric average method complexity (AMC) is 
the average of the Cyclomatic Complexity of all functions in the 
application or module. Therefore, the AMCm of the application or 
module m is calculated by the formula AMC =  WMCm/NPMm. 
The response for a class (RFC) is the number of functions that 
respond to a message from the application or module itself. It is 
calculated by counting the number of distinct functions that are 
invoked by the application or module, no matter if the function was 
defined inside the application or module itself or not. Functions 
called multiple times are counted only once [8]. 

The metrics for measuring the property coupling are presented 
next. Low values are indicative of loose coupling, while high ones 
indicate tight coupling. The coupling between object classes (CBO) 
is calculated by counting the distinct afferent and efferent modules 
or packages of an application or module [8]. Since the example 
applications do not have circular dependencies, this is the same as 
the sum of the afferent coupling (Ca) and the efferent coupling (Ce). 
Ca is calculated by counting the number of modules or applications 
that call the target module’s functions or application. On the other 
hand, Ce is calculated by counting the modules or packages from 
which the target application or module calls functions. For example, 
if the application A invokes a function of module B, the Ca of B, 
as well as the Ce if A, increases by one. In the considered example, 
no packages are included for the calculation of Ca since packages 
usually do not call any functions of a Shiny application or module. 
In the calculation of Ce, applications were not regarded because no 
application or module calls another application. 

Now, the metrics for evaluating the design property cohesion 
are discussed. The possible and desired values differ between the 
metrics. The lack of cohesion in methods (LCOM) corresponds to the 
difference between the number of function pairs without and with 
common non-static fields. The LCOM of an application or module 
m is calculated by LCOM,, = b — c, where b equals the number of 
function pairs that do not reference to similar non-static fields and 
c equals the number of function pairs that do reference to at least 
one similar non-static field [8]. Fields that are referenced indirectly 
by the function of interest via invoked functions are also counted. If 
the result is negative, LCOM is set to 0. This metric’s value should 
be as low as possible, whereby a value of zero indicates a cohesive 
application or module. However, the evaluation of a value depends 
on the total number of defined functions in a component. Therefore, 
the metric LCOM3 was developed to address this problem of the 
lack of possibility of comparison. The LCOM3 of an application or 
module m is calculated by LCOM3 = ((x — f - a))/((a—- f -a)), 
where f equals NPM), a equals the sum of NOF,,, and NPRIF,,, and 
x equals the sum of the number of referenced non-static fields of all 
functions. If only one function is defined in an application or module 
or has no non-static fields, LCOM3 cannot be calculated and is set to 
0. The value of LCOM3 is always between 0 and 2, where 0 indicates 
a high cohesion. Values above 1 are critical since this shows the 
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existence of values that are not accessed by any function within 
this application or module (Henderson-Sellers, 1996, p. 147). The 
metric cohesion among methods in class (CAM) of an application or 
module m is calculated by CAM = p/((y+f)), where p equals the 
sum of the number of different types of method parameters of each 
function defined in the application or module, f equals the sum 
of NPM,,, and NPRIM,, and y equals the number of distinct types 
of method parameter of all functions in the application or module. 
A type of a parameter is thereby the class a parameter should 
inherit, for example, a list, an integer or a data frame. Also, local 
variables defined in the surrounded function of a private function 
are considered to be parameters. The range of CAM is between 0, 
which indicates low cohesion and 1, indicating high cohesion [3]. 
Finally, the calculated metrics for the assessment of the design 
property encapsulation are presented. The number of private fields 
(NPRIF) and the number of private methods (NPRIM) is calculated 
by counting the corresponding elements in the application or mod- 
ule. The data access metrics (DAM) correspond the ratio of NPRIF 
to the total number of fields defined in an application or module. 


7.2 Results of metrics and evaluation 


The explained metrics were calculated for the modularized and 
non-modularized software. For the non-modularized software, the 
metrics were calculated collectively for the scripts run.R, ui.R, and 
server.R. These three scripts of the modularized software were also 
calculated as a group referred to as application. For modules of 
the modularized software, the metrics were calculated individually 
in each case. The sum and average of the values of the individual 
modules were calculated for each metric to optimize the compa- 
rability. For the design properties of coupling and cohesion, the 
calculation of the sum of the modules is not meaningful because 
these are a measure of the dependency on a component or within 
the component itself. 

The differences in results between the modularized and non- 
modularized software are now considered for each design property. 
Thereby the relationship to the sub-characteristics of the maintain- 
ability is again addressed. An overview of the results is shown in 
Table 4. For the modularized software, in addition to the sum and 
the average, the results of only the application (app) itself are also 
shown. 

The sum of the metrics results associated with the design prop- 
erty size is mostly larger for the modularized software than the 
non-modularized software results. Exceptions are the NOF and the 
NMCI and the NSTAF, for which there is no difference between 
the considered software. Looking at the average of the individual 
modules, they are all smaller than the monolithic application. The 
larger sum of the modularized application can be explained because 
at least two public methods are defined as almost every module 
interface. This increases the NPM, the NMCE and therefore also 
the NMC and the LOC. With the increasing size of software, the 
analyzability and thus modifiability decreases. The metrics of the 
design property size indicate worse maintainability of the modu- 
larized application as a whole in comparison with the monolithic 
software. However, the individual components are substantially 
more compact and individually better maintainable. The same state- 
ment can be made for the calculated metrics of complexity. The 
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Design vee Modularized Not 
Property All Modu- 
Modules | larized 
(Avg) 
50.381 
Size 
Complexity 1.78 
168 
9 = 13 
Coupling 
Ce 9 = 3.38 
Cohesion 
Encapsulation 


Table 4: Results of the calculated metrics for the application 
in the non-modularized and modularized design 


sum of the individual modules indicates a higher complexity of 
the entire modularized software than the non-modularized soft- 
ware. On average, however, the individual modules are again less 
complex than the non-modularized software. The increased WMC 
could also be caused by the increased number of functions, since 
each function has a Cyclomatic Complexity of at least 1. Therefore, 
the AMC is more meaningful for assessing the complexity, which 
indicates that the modularized software functions are marginally 
less complex. When calculating the sum of the RFC, the called 
functions were only counted once across all modules. Again, this 
higher value can be explained by the higher number of functions in 
the modularized software. Also, concerning complexity, the state- 
ment can be made that the modularized application has overall 
worse maintainability than the non-modularized software. Once 
again, however, the single modules are less complex and better 
maintainable. As already explained initially, only the average value 
of the modules is compared with the monolithic software for the 
properties coupling and cohesion. Thereby all computed values 
show a looser coupling as well as higher cohesion with the mod- 
ularized software. The only exception is Ca, but this is since the 
monolithic software as an application cannot be called from outside. 
As discussed before, looser coupling increases all sub characteristics 
of maintainability. The in-creasing cohesion improves the loose 
coupling additionally and therefore has a positive effect on the 
maintainability too. Finally, the metrics used to evaluate the soft- 
ware design property encapsulation are evaluated. Here, it has to be 
said that encapsulation is unnecessary for monolithic architectures 
since the application is not called from the outside. Nevertheless, 
the calculated metrics indicate good encapsulation of the modules 
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because especially DAM has a very high value. This increases the 
sub-characteristic modularity, reusability and modifiability. Alto- 
gether based on the metrics, the modularized software as an entire 
system is larger and more complex than the monolithic software. 
This indicates a worse analysability, modifiability, reusability and 
thus maintainability in total. However, the individual modules could 
also be regarded separately from each other, whereby the computed 
metrics indicate substantially higher maintainability. In addition to 
the quantitative evaluation, the results are now also discussed via a 
qualitative approach. Although modularization increases the overall 
size and complexity of software, it also improves its maintainability. 
However, this depends on the modularization implementation — a 
correct amount must be defined to have cohesive modules on the 
one hand and loosely coupled modules on the other hand. This 
allows a significant improvement of the reusability, analysability 
and therefore also modifiability of the software. Also, a well-chosen 
modularization offers the advantage of allowing the modules to be 
considered separately from one another. The regarded section of 
the software during the maintenance becomes much smaller and 
less complex than with a monolithic application. Also, the effect of 
changes within a module in other modules can be estimated and in- 
tercepted much better. Modules can be called multiple times in one 
or more applications, minimizing or even preventing redundant pro- 
gram code. Altogether with a qualitative view, the modularization 
simplifies the maintainability of the software substantially. 


8 DEPLOYMENT 


In a modern DevOps context pipelines usually concern automa- 
tion from source code to the deployment of the finished prod- 
uct. Pipelines are omnipresent in today’s IT landscape. From data 
pipelines to integration and testing pipelines up deployment pipelines, 
the term finds broad acceptance and a wide definition. In our system 
architecture, the steps considered are building, testing, packaging, 
and deploying a piece of software. We followed this DevOps para- 
digm in our software development process as follows. 

The web service is executed in a Docker container on a server 
provided by the Fraunhofer ITEM. This server has 48 cores and 
256 GB random access memory (RAM) and thus provides sufficient 
computing power to analyze the gene expression data. The pro- 
gram code is stored in a GitLab repository. A Dockerfile is used 
to define the build and deployment of the service and define and 
install prerequisites and dependencies. In turn, the building and 
pushing of the image into the Docker registry is controlled via a 
GitLab CI/CD file. This file enables continuous integration (CI) and 
continuous deployment (CD), meaning running a defined pipeline 
as well as deploying the application to production whenever a mod- 
ification is made. The pipeline consists of a test stage for checking 
the Dockerfile and a deploy stage for deploying the service to the 
server. Using Portainer, a dashboard for the management of Docker 
containers, the container for deployment is created and maintained. 


9 CONCLUSION 


The Shiny applications found in the literature are often quite small 
and used as dashboards for data visualization. For handling the chal- 
lenge of developing large applications, we investigated whether this 
framework is suitable for the development of complex applications. 
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Our analysis focuses on the formal evaluation of the maintainability 
of Shiny applications. We provided a comparison of modularized 
and non-modularized application version and followed a quantita- 
tive approach based on chosen software quality metrics. Quantita- 
tively, modularized applications are larger and more complex than 
monolithic applications. However, the modules can be considered 
independently so that the resulting maintainability of modularized 
applications is rated better overall. Most software quality metrics 
found in the literature are developed to evaluate object-oriented 
(OO) principles. Therefore, a subgoal of our article was the inves- 
tigation of the influence of OO principles on the maintainability 
of software. To be able to transfer these principles to Shiny, the 
equivalence of Shiny application elements to elements from the OO 
paradigm needed for the computation of the metrics was examined. 
For this, the purpose of the OO elements and components of Shiny 
with the same or similar purpose were investigated. Based on these 
assumptions, the software quality metrics were calculated. Our 
application is located in the biomedical realm; hence the choice 
for Shiny is rooted in the availability of libraries for the backend 
pipeline in the R programming language. We believe that our discus- 
sion generalizes to other web applications that follow a workflow 
character and that relies on tabs in the web pages for which Shiny 
components could be reused. Nevertheless, other domains that rely 
less on specific R backend libraries and that might have a focus on 
the microservice paradigm might profit more from alternative web 
frameworks (for example, based on Javascript). 

In future work, we aim to add capabilities to execute workflows 
on other input file formats. A major research question in this regard 
is whether the modularization of our software proves to support 
the modifiability aspect. 
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ABSTRACT 


Data lakes are supposed to enable analysts to perform more effi- 
cient and efficacious data analysis by crossing multiple existing 
data sources, processes and analyses. However, it is impossible to 
achieve that when a data lake does not have a metadata governance 
system that progressively capitalizes on all the performed analysis 
experiments. The objective of this paper is to have an easily acces- 
sible, reusable data lake that capitalizes on all user experiences. To 
meet this need, we propose an analysis-oriented metadata model 
for data lakes. This model includes the descriptive information of 
datasets and their attributes, as well as all metadata related to the 
machine learning analyzes performed on these datasets. To illus- 
trate our metadata solution, we implemented a web application 
of data lake metadata management. This application allows users 
to find and use existing data, processes and analyses by searching 
relevant metadata stored in a NoSQL data store within the data 
lake. To demonstrate how to easily discover metadata with the 
application, we present two use cases, with real data, including 
datasets similarity detection and machine learning guidance. 
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1 INTRODUCTION 


IoT data is increasingly integrated into the core of today’s society. 
To analyze a market or a product, decision makers must integrate 
these different IoT data but also combine them with massive data 
produced internally as well as externally data such as Open Data. 
To obtain a complete vision, it is therefore necessary to integrate 
both voluminous fast data and numerous small data. 

Faced with the plurality of data types, the data lake (DL) is one of 
the most appropriate solutions. DL is a big data analytics solution 
that allows users to ingest data in its raw form from different sources 
and prepare it for different types of data analysis [11]. With a DL, 
all types of data can be stored for further analysis for different users. 
Even if, in principle, the DL seems to adapt to different needs, it is 
necessary to define an architecture that takes into account all the 
usual needs of big data analytics focused on large volume data or 
data with high velocity while integrating the constraints specific 
to IoT and small data. 

To meet the previously stated objectives and contrary to other 
proposals [6], we propose a complete solution composed of a func- 
tional architecture, a technical architecture of a Multi-Zone Data 
Lake (MZDL) and experiment assessments. This Data lake must 
allow (i) ingest of different data sources, (ii) implement of pre- 
processing upstream of analytics, (iii) provide access and consump- 
tion of this pre-processed data and finally (iv) properly govern the 
data throughout previous steps to avoid inaccessible, invisible, in- 
comprehensible [1] or untrustworthy data. To address the veracity 
problem, we have added an efficient metadata management system 
that allows the data characterization and security mechanisms and 
tools that allows greater confidence in the data. This solution will 
be described from a logical and physical point of view. 

The paper is organized as follows. In section 2, a state of the art on 
IoT data analysis is presented and it shows the limitations of current 
work. Our proposal is explained in the following sections. In section 
3, we provide a functional architecture based on 4 zones. We focus 
on a metadata model which is not a simple data catalog but which 
saves all the information related to ingested data, transformation 
processing and analyzes. This metadata model will facilitate the 
task of a data analyst by allowing him to easily search for ingested 
data or even transformation processing or analyzes. In section 4, we 
expose the different components of our technical architecture and 
we justify our choices. A discussion about Hadoop is also presented. 
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Our proposal is experimented in the specific context (NeOCampus 
project). In section 5, we explained the ingested source of data, 
the transformation processes we implemented and the analyzes 
carried out. Finally, we present the metadata management system 
dedicated to data analysts. 


2 RELATED WORK 


According to the recent survey on Big data for IoT [4], we notice that 
most of work deal with only one domain application for example 
Healthcare, energy, building automation, smart cities and so on. For 
each domain, the strategy is to collect data coming from several 
sensors related to the study domain. This IoT big data collected from 
sensors are different from other big data as they are heterogeneous, 
various and grow rapidly. Since the definition of big data includes 
also the possibility to cross several type of data, the multi-domain 
sensors seems to be much more difficult to resolve than one-domain 
sensors. 

Our second observation concerns the storage and analytics per- 
formed on IoT Data. Commonly the storage of IoT data is done 
on the cloud [13][4]. Concerning the analytics, we identify the 
following categories according to [7] : (i) real-time analytics, (ii) 
offline-analytics also named batch analytics, (iii) memory-level ana- 
lytics, (vi) BI analytics and (v) Massive analytics. The most popular 
type of analytics are real-time and batch analytics. The Batch ana- 
lytics does not require quick response, Hadoop framework is the 
most known framework used in this area with its storage HDFS 
systems and the MapReduce parallel computing paradigm. For real- 
time analytics , we distringuish two categories of real-time : strict 
and flexible. Strict real time is necessary for critical applications 
and requires specific data management architectures. Flexible real- 
time (or near-real time) allows freedom in terms of delays and the 
maximum delays are greater than strict real-time. Our study con- 
cern this type of applications. The IoT data handled in this case are 
called stream data [5]. They arrive continuously with a high veloc- 
ity. These data are eventually stored and analyzed in architectures 
such as Lambda architectures or Kappa architectures. 

From our both observations concerning the general work on 
IoT Big Data, the best solution seems to be to use the data lake 
paradigm in order to capture multi-domain sensors and multiple 
type of analytics. DL, previously emerged only as a data repository 
[3], is today one of the most popular big data analysis solutions. A 
DL ensures data ingestion, raw and processed data storage, data 
preparation by different users and data consumption for different 
types of analyses [11]. 

In the literature, we identify some papers dealing with the prob- 
lem of storage and analysis of IoT data under the angle of data lake. 
These papers can be divided into : (i) zone-less data lake architec- 
tures and (ii) zone-based data lake architectures. 

In the category ’Zone-less IoT data lake architectures’, the au- 
thors of [6] present a big data lake architecture for multilevel stream- 
ing analytics based on Hadoop. We distinguish relational and stream 
twitter ingestion flow in a single Hadoop/HDFS storage space. The 
zone-less architecture lacks clear reusage strategy namely without 
metadata on the processes. 
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In the category ’Zone-based IoT data lake architectures’, the 
authors of [9] propose a big data lake platform for smart city appli- 
cation. The platform can be considered as zone-based architecture 
and strongly relies on Hadoop Ecosystem with data ingestion, data 
storage, data exploration, analytics and data visualization zones. 
Two level of data ingestion are performed (manual and automatic). 
The stream data ingestion is realized in near-real time with Flume 
framework. Concerning metadata implementation, the authors use 
simple mechanisms rather related to technical implementation. 
They use also simple embedding metadata information related to 
file names and paths and metadata description files. The authors 
of [10], propose a four zone-based data lake architecture for smart 
grids analytics on the cloud. They collect smart meter, images and 
video data in the domain of smart Grids. Their architecture is based 
on the Lambda architecture with both near-real time and batch 
analytics. This work presents technical implementations and ex- 
perimentations but this architecture does not enable reusage in the 
data lake. 

According to the challenges and issues related to the big data IoT 
[2], the data diversity in data storage and analysis is not completely 
well addressed in the previous cited works. Moreover, knowledge 
discovery and computation complexities is a real lack in these works. 
The lack of metadata is commonly observed in these works yet it is 
an unavoidable solution. The last challenge concern information 
security which can be addressed with specific services as well as 
metadata mechanisms. 

Our contribution proposes a complete zone-based data lake ar- 
chitecture addressing the different challenges cited below and appli- 
cable to several fields. We use a metadata management system that 
covers the different steps from ingestion , processing to analysis to 
better reusage of the data lake data. 


3 A ZONE-BASED DATA LAKE:A 
FUNCTIONAL ARCHITECTURE 


In this section we describe our proposed zone-based data lake func- 
tional architecture and the metadata model dedicated to the archi- 
tecture. 


3.1 Components 


This data lake functional architecture facilitates data analytics by 
allowing users to find, access, interoperate and reuse existing data, 
data preparation processes and analyses. 

In order to answer to near-real time and batch processing require- 
ments, we propose in Figure 1 our functional Data Lake architecture 
ingesting both IoT, small and Big Datasets. This architecture is a 
zone-based architecture proposed by [11]. The different zones are 
described as follows : 


e The raw ingestion zone allows users to ingest data and 
stores these data in their native format. Batch data are in- 
gested and stored in this area in their native format, near-real 
time data are stored after steam processing if users need to 
store the data. The scalability of this zone is the main crite- 
rion to have in order to be able to handle different volumes 
of data from different domains. 

e The process zone allows users to prepare data according to 
user needs and stores all the intermediate data and processes. 
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In this area, data are valorized by batch process or stream 
process. This zone concentrates all the treatments and all 
the business knowledge applied on the raw data. 

e The Access zone allows the processed data to be accessed, 
analyzed, consumed and used in any specific business pro- 
cess. It is necessary to be able to enable the consumption of 
data for visualization, real-time analytics, advanced machine 
learning or BI analytics and more generally for decision 
support systems. 

e The govern zone, applied on all the other zones, is in charge 
of insuring data security, data quality, data life-cycle, data 
access and metadata management. This zone has an impor- 
tant role in all stages of the data sustainability and reuse. 
It is composed of two sub-zones : (i) metadata storage con- 
cerns the raw data, the processes and analyses; We wanted 
to allow the tracking of the data life cycle, from insertion to 
consumption, and thus allow a follow-up of the needs and a 
relevant future reuse ; (ii) security mechanisms allow authen- 
tication, authorization and encryption as well as multilevel 
security policy (for instance external and internal threats). 
This sub-zone allows also quality of service by monitoring a 
multilevel resources consumption. 


3.2 A Metadata Model for the Data Lake 
Architecture 


It quickly became apparent that there was a potential and a need 
to use metadata for data durability and, more broadly, for the en- 
hancement of this data. We have therefore decided to create a zone 
exclusively dedicated to metadata from a functional point of view. 
This zone aims to be the point of conservation of all the metadata, 
to provide a true valorization of all the data and to allow the reuse 
of these data in contexts different from the original context. To 
achieve this goal, it becomes necessary to follow the life cycle of 
the data. To do this, it is as necessary to design a data management 
tool as to manage the data. 


3.2.1 Metadata for Data ingestion. Data ingestion, the first phase 
of data life-cycle in a data lake, concerns ingesting external datasets 
into a data lake. During this phase, metadata of all the ingested 
datasets should be collected to ensure the findability, reusability, 
security of datasets (see white classes in Fig. 2). To do so, different 
types of metadata should be generated : 


e Metadata of ingestion process (class Ingest) which includes 
ingestion program source code, execution details, link to 
users, and the upstream and downstream datasets. 

e Metadata of ingested datasets that includes basic character- 
istics (class DatalakeDataset), schematic metadata (classes 
DLStructuredDataset, DLSemiStructuredDataset, DLUnstruc- 
turedDataset and their components) and semantic metadata 
(classes Tag, attribute DatalakeDataset.description). 

e Metadata of data veracity (classes VeracityIndex) which is a 
composite index which combines different measures [12]: 
objectivity, truthfulness and credibility. 

e Metadata of data security (classes SensitivityMark, Sensitic- 
ityLevel) to protect sensitive or private data. 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


e Metadata of dataset relationships (classes RelationshipDS, Anal- 
ysisDSRelationship) which can be predefined relationships 
such as the similarity, correlation, containment and logi- 
cal cluster or user defined relationships such as a common 
subject. 


Data ingestion metadata are only instantiated for batch ingestion. 
Real-time data are processed directly without passing through the 
ingestion phase. Note that when users want to back up real-time 
data, this activity is treated as batch ingestion with additional infor- 
mation SourceOfSteam to indicate the connected object/machine. 


3.2.2 Metadata for Data Processes. To ensure that users can find 
how data are processed and stored in the DL, our metadata model in- 
cludes the information of data processes (see green classes in Fig. 2). 
With the process metadata, users can know (i) process characteristics 
that are the basic information about a process, for instance, who did 
what and when, (ii) process definition (attribute Process.description) 
that explains the context, meaning and the objective of a process, 
(iii) technical information is about the source code and execution in- 
formation which is useful when users want to know how processes 
are deployed or whtn they want to modify or reuse a process, (iv) 
process content concerns coarse-grained transformation operators 
dedicating to qualify data processing [8]. 

Data processes metadata are instantiated for both batch and 
(near) real-time process, the only difference between them is that 
the source dataset of a batch process is a dataset already ingested 
in the data lake (class DatalakeDataset) while the source dataset of 
real-time process is an external dataset (class DatasetSource). 


3.2.3, Metadata for Data Analysis. To facilitate data analytics, we 
also consider the metadata of analysis (see blue classes in Fig. 2) 
to help users to understand the nature of datasets, to find existing 
analyzes and their used models, outputs and evaluation, etc. So that 
users can choose the most appropriate way to analyze data more 
efficiently [14]. The data analyzes metadata can be applied to both 
batch (classes Analysis and Study) and real-time (class Realtime- 
Analysis and Study) analyzes. For batch analyzes, in our model, 
we precise the useful metadata of machine learning (classes Im- 
plementation, Software, Parameter, ParameterSetting, Landmarker, 
Algotithm, OutputModel, ModelEvaluation, EvaluationMeasure, Eval- 
uationSpecification, Task). 


4 A ZONE-BASED DATA LAKE:A 
TECHNICAL ARCHITECTURE 


We present our technical data lake architecture in Figure 3. In the 
following we describe the technical implementation of each zone. 


Raw Data Zone. For the raw data storage zone, we decided to 
implement Openstack Swift (OSS). OSS is and object-oriented stor- 
age system which allows any type of data to be stored at low cost 
with the only consideration being the size of the data. OSS allows 
us to respond to volume, velocity and variety because it allows us 
to store data independently of the content and structure of the data, 
allowing us raw ingestion of any data. OSS is part of the Openstack 
environment allowing native compatibility with Openstack tools 
and a large community centered around open-source. Communi- 
cations are done through a RESTful API, a standard that provides 
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Figure 1: loT data lake functionnal architecture 


interconnectivity and ease of use. OSS allows linear and elastic 
scaling of the raw data zone. 


Process Zone. The main requirement of the process zone is to 
allow users to prepare their data and store intermediate results 
and processes. We chose Apache Airflow because it allows us to 
rigorously manage the various transformation workflows on the 
data. This open-source workflow scheduling tool allows to orga- 
nize and schedule processes in a simple way. This tool allows the 
implementation of organized processing chain while being very 
easy to use. It includes a wide variety of tools and programming 
languages that allow for the definition of transformation functions 
(for ETL), the use of standard tools from the data analysts’ toolbox 
and the use of large-scale data processing frameworks like Apache 
Spark. All the processes implemented in this process zone are also 
stored by OSS in order to easy reuse processes. 


Access zone. Once the data are formatted, they can be consumed 
in the access zone. In order to resolve the interoperability problem 
with the downstream application of the data lake, we decided to 
design this zone as a multi-store allowing the implementation of 
a wide variety of tools through the containerization concept (for 
instance dockers). This includes all databases, ETL tools, report- 
ing tools and many private or custom tools. The cooperation of 
this approach with the containerization tools makes it possible to 
instantiate these tools on the fly, offering great flexibility. 


Govern Zone. This zone is composed of two sub-zones : (i) meta- 
data management sub-zone and (ii) security and monitoring sub- 
zone. 


e The metadata management sub-zone. Metadata, defined with 
the model described in section 3 are stored in a dedicated 
data management system. Metadata refers to any metadata 
over raw or transformed data and processes. The metadata 
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system solution designed includes a NoSQL graph-oriented 
database. The advantages of a graph database are multiple: 
Graph database has a good flexibility, we presented a gen- 
eral metadata model in section 3, it is adapted to different 
types of datasets for batch and real-time analyses. Moreover, 
for specific needs, users can always extend to model to fit 
their requirements. (ii) Graph database provides a powerful 
manipulation tool which includes not only a standard query 
language but also integrated machine learning functions. 
Moreover, when it is easier to query a graph database when 
the search depth is important. Metadata are updated several 
steps in the data life cycle : (i) initial metadata are inserted 
when data ingested and (ii) among all operations done on 
the data to follow their transformations. 
The security and monitoring sub-zone. From the design stage, 
we integrate the implementation of security mechanisms, au- 
thentication and monitoring of the services allowing users to 
have a control of the data lake architecture and its resources. 
— For the security, we treat different issues: first we should 
identify users then we should control their authentication. 
For identification, we decided to implement Openstack 
Keystone. This service allows to keep a list of users ac- 
counts with token-based authentication (UUID token or 
Fernet token). Fernet token use AES256 cryptographic pro- 
tocol and HMAC SHA256 signature verification, creating 
a secured identification and trustworthy authentication. 
This is implemented as an input to the architecture in the 
RESTful API but can be integrated in the pipeline of each 
service involved via the verification of the validity of the 
token before any operation. Authentication requires multi- 
level access control to limit the number of entry points in 
the architecture and limit the number of tools used. All 
access are restricted via RestAPI and reverse proxy. One of 
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the advantages of Openstack Keystone is the compatibility 
with very large identification services (NIS, LDAP) and/or 
authentication services (Kerberos), either natively or by 
customizing the service. It is thus possible to integrate this 
authentication system with pre-established systems 

— The monitoring concerns is handled at several levels. As 
it stands, it is necessary to implement three different lev- 
els of monitoring. First, the lowest level is a system and 
network level. The hosting platform defines more pre- 
cisely the needs and tools to be implemented. Tools such 
as SNMP, Prometheus and/or Kubernetes monitoring tools 
are considered to meet security needs. The intermediate 
level of monitoring is application monitoring. It is nec- 
essary to be able to track the resource consumption of 
each service in the architecture to be able to best scale the 
platform. Openstack Ceilometer can be a partial solution 
for this problem. The higher level of monitoring is a mon- 
itoring for each user of the resources and services used. 
This can be used to track meta information on projects 
and a higher level vision but also to allow a possible billing 
of services in a private context of service provisioning of 
this architecture. 


Discussion on Hadoop. When it comes to data lake, the most 
popular solution is the Hadoop ecosystem. This solution is now 
maintained for many years by the Apache foundation, with a large 
community, making the solution very appropriate for many use 
cases. Hadoop has been designed to store and process very large 
files through a very large parallelization capacity. However, even 
if it is well adapted to the volumetry, this solution does not seem 
to be adapted to the other aspects of big data. Indeed, the main 
problem arises on the variety of data, as we approach it with IoT 
data, and small data. This problem comes from the design choices of 
the file system on which the entire ecosystem is based, HDFS. With 
HDFS, data is divided into data blocks to allow for data replication. 
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These data blocks are by default 64 or 128 MB in size and are stored 
on DataNodes !. Each DataNode keeps track of the data it stores 
with about 150 bytes of data per data in RAM. The problem arises 
when the data is smaller than the block size. Thus, the DataNodes’ 
memory fills up before their storage spaces. In most cases, a sensor 
reading is about 100 bytes. It is difficult to implement a message 
aggregation policy, which would circumvent this problem, that is 
efficient, lightweight and does not involve an increase in the com- 
plexity of operations performed on this data. Our context requires 
the management of a wide variety of data, from high volume batch 
and high velocity stream data to IoT and small data. Data security 
management and simplicity of deployment of the architecture are 
important requirements. Thus, we preferred the solution described 
to the Hadoop ecosystem. 


5 EXPERIMENTAL VALIDATION 


To validate our solution, we have implemented it for a real project - 
NeOCampus. In this section, firstly, we introduce the context and 
details of the project; secondly, we present some results of our 
implementation. 


5.1 Context of the Experimentation 


In order to validate our proposal, the data lake architecture * was 
implemented as a proof of concept (POC) for the NeOCampus 
project. A specific branch has been created on the repository to track 
experiment source code used.> NeOCampus * is a multidisciplinary 
project gathering many French laboratories with several skills, 
from computer science to social science. The NeOCampus project 
is centered around ambient systems and a testing ground to realize 
an innovative university campus. The ambient systems use many 


‘https://hadoop.apache.org/docs/r3.3.0/hadoop-project- dist/hadoop-hdfs/ 
HdfsDesign.html 

*https://gitlab.irit.fr/datalake/docker_datalake 
$https://gitlab.irit.fr/datalake/docker_datalake/-/tree/use_case_paper_IDEAS 
“https://www.irit.fr/neocampus/fr/ 
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sensors leading to a sensor network. The management of this real 
time IoT data is an important issue for the NeOCampus project. As 
the project includes several researchers, the management concerns 
also batch data coming from research projects. Our architecture 
allows the NeOCampus project to satisfy several objectives such as 
enabling the reuse of sensors data for research purposes, facilitating 
brainstorming between several user profiles, encouraging the reuse 
of processes developed in different projects through the metadata 
system, cross reference data from several sensors with batch data in 
a simplified way, create rich dashboards and so on. This experiment 
has been done on OSIRIM platform with 4 virtual machine from 
a VMWare virtualisation server. A total of 20 vCPU (1 thread per 
core, 2.6 GHz each) and 32 Gigabytes of RAM have been allocated. 


5.2 Sources of Data 
For this POC, we considered two categories of data: 


e Streaming Data concern the sensor network of the NeO- 
Campus project. In particular we have two equipped rooms 
in the Paul Sabatier University. The deployed sensors are : (i) 
luminosity sensor (in lux), (ii) hygrometry sensor (in %r.h), 
(iii) temperature sensor (degrees are in celsius) , (iv) CO2 
sensor (in ppm) (v) energy sensor (complex measurements 
with several datas). The IoT data arrive at regular intervals, 
on the order of minutes which lead to a continuous flow of 
45 messages per minute corresponding to 105,000 messages 
in 2 days or 2,000,000 messages per month. Sensor data are 
collected from the MQTT broker of the NeOCampus IoT net- 
work as soon as they arrive. A process subscribes to all topics 
on the broker to get all sensors reading create. In Figure 4), 
we have an example of the MQTT topic identification for an 
humidity sensor. 

e Batch Data concern research datasets such as the open data 
of "Météo-France" which contains data collected by about 
fifty meteorological stations on the French territory. These 
data are updated every 3 hours and they are available and 
can be downloaded in CSV files for a defined period (3 hours 
or over a month). Each CSV file contains 41 columns such as 
temperature, hygrometry, atmospheric pressure, wind speed, 
cloud types and their altitude or the type of barometric trend. 
The data are temporal series indexed by times and station 
number. 


5.3 Data processing 


Regarding data processing, we implemented different processing 
pipelines to process each type of data through Airflow (see Figure. 5). 
The objective of the processes presented in Figure. 5 is to format 
the data in order to insert them in a time-series oriented database, 
InfluxDB. 
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The choice of pipeline is made by the help of the names of data 
containers. As the matter of fact, different pipelines share similar 
operations except for the data parse method. For real-time data, 
data formatting is applied on each received message sent by sensors. 
For batch data, such as meteorological data files, data formatting is 
applied on each row of the CSV files during the ingestion phase. 


5.4 Data Access Use Case 


In this section, we present a possible use case for the data ingested 
in our data lake architecture. Our use case is a monitoring dash- 
board for the different buildings of the university. Thus we are 
able to follow-up of the systems and infrastructures in the different 
buildings of the university. The sensor data can be used to monitor 
the ambient rooms inside the buildings. At the same time, we can 
follow the meteorological data collected by Météo France. By cross- 
ing both data, it is possible to investigate the possible correlations 
between these different data. Moreover, it is possible to detect mal- 
functions or failures of electrical or heating systems in a short time. 
The comparison of these data can allow us to distinguish a possible 
flooding from an open window in an equipped room. In Figure 6, we 
show the results of this use case in which we employ the InfluxDB 
as a web graphical visualization and dashboard creation tools. This 
tool allows us to visualize in real time via a native data refreshment 
system and allows us to follow in real time the evolution of the data. 
From the same dataset, it is possible to set up several different dash- 
boards for different purposes. Thanks to InfluxDB, it is also possible 
to set up aggregation functions on the fly and allow processing 
directly in the dashboard without having to modify the data. Unit 
changes can be performed as in the example. It could be possible 
to apply more complex statistical analysis functions and to simply 
implement tools belonging to the field of statistical process control 
for the implementation of alerts by setting up control charts. 


5.5 Metadata Management System 


To facilitate the searching and use of metadata, we have imple- 
mented a metadata management web-application. The application 
has an ergonomic user-oriented interface through which users can 
easily find existing datasets, processes and analyses to find possibly 
useful information for data analytics. The front-end of application 
is developed with HTML/CSS and JavaScript. The back-end is de- 
veloped with an API NeoViz to visualize the metadata stored in a 
Neo4j database. 

To illustrate the application, we ingested real-time data from 
connected objects and batch datasets (CSV files) which are the two 
types of data used for NeOCampus. With the application, users can 
easily search existing datasets, processes or analysis by typing tags. 
Moreover, with the interface of the application, users can discover 
different aspects of information, such as basic element character- 
istics, lineage information and relationships among datasets. For 
example, when a user searches for humidity datasets, he can find 
that there is a dataset 190 which concerns humidity (see top image 
in Fig. 7). If he wants to know how the dataset is ingested, he can 
switch to the Lineage tab so that he finds out that a dataset source 
passed a real time process and then ingested into the dataset 190. 
Moreover, the lineage tab is available for batch data life cycle too, 
for instance, if users find a batch dataset, he can see a data source is 
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Figure 5: Different workflow pipelines visualization 
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Figure 6: Indoors and Outdoors Monitoring 


ingested directly into the data lake without any real time processing big data analytics. From a theoretical point of view, with a func- 
(see button image in Fig. 7). tional architecture in four zones enriched with a metamodel that 
adapts to any type of data, as well as from a technical point of view, 
with an architecture designed to take advantage of the genericity of 
object-oriented storage. Our solution offers a solution that adapts to 


6 CONCLUSIONS any type of data, in terms of volume or type. More than the ability 


This paper fall within the context of big data analytics with varied to manage any type of data, it becomes possible to cross data and 
data. It includes batch data with large volume, stream data with address the challenge of silo architecture and development in IoT. 
high velocity or IoT sensor readings with great variety. Our so- We have implemented our solution through a concrete case 
lution addresses the issues raised by this new traffic created by (neOCampus operation) allowing the management of (i) data flows 


connected objects while maintaining the primary applications of 
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Figure 7: Searching results in the metadata management web application 


(45 messages per minute) and (ii) datasets containing data from 
another context on about fifty weather stations on the whole French 
territory. The future work on this architecture will focus on 3 points: 
(i) security and the use of virtualization for multi-level user space 
separation and automatic deployment, (ii) working and processing 
tools, including the integration of tools for the analysis of data 
streams with temporal constraints and (iii) a strong integration of 
data semantics through metadata management. 


REFERENCES 


[1] Ayman Alserafi, Alberto Abello, Oscar Romero, and Toon Calders. 2016. Towards 
information profiling: data lake content metadata management. In 2016 IEEE 16th 
International Conference on Data Mining Workshops (ICDMW). IEEE, 178-185. 

[2] Fabian Constante Nicolalde, Fernando Silva, Boris Herrera, and Antonio Pereira. 
2018. Big Data Analytics in IOT: Challenges, Open Research Issues and Tools. 
In Trends and Advances in Information Systems and Technologies, Alvaro Rocha, 
Hojjat Adeli, Luis Paulo Reis, and Sandra Costanzo (Eds.). Springer International 
Publishing, Cham, 775-788. 

[3] James Dixon. 2010. Pentaho, Hadoop, and Data Lakes. _https://jamesdixon. 
wordpress.com/2010/10/14/pentaho-hadoop-and- data-lakes/ 

[4] Mouzhi Ge, Hind Bangui, and Barbora Buhnova. 2018. Big data for internet of 
things: a survey. Future generation computer systems 87 (2018), 601-614. 

[5] Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi. 2019. Big data stream 
analysis: a systematic literature review. Journal of Big Data 6, 1 (2019), 1-30. 

[6] Ruoran Liu, Haruna Isah, and Farhana Zulkernine. 2020. A Big Data Lake for 
Multilevel Streaming Analytics. In 2020 1st International Conference on Big Data 


IDEAS 2021: the 25th anniversary 


[11] 


[12] 


[13] 


[14] 


Analytics and Practices (IBDAP). IEEE, 1-6. 

Mohsen Marjani, Fariza Nasaruddin, Abdullah Gani, Ahmad Karim, Ibrahim 
Abaker Targio Hashem, Aisha Siddiqa, and Ibrar Yaqoob. 2017. Big IoT Data 
Analytics: Architecture, Opportunities, and Open Research Challenges. IEEE 
Access 5 (2017), 5247-5261. http://dblp.uni-trier.de/db/journals/access/access5. 
html#MarjaniNGKHSY17 

Imen Megdiche, Franck Ravat, and Yan Zhao. 2021. Metadata Management on 
Data Processing in Data Lakes. In SOFSEM 2021: Theory and Practice of Computer 
Science - 47th International Conference on Current Trends in Theory and Practice 
of Computer Science, SOFSEM 2021, Bolzano-Bozen, Italy, January 25-29, 2021, 
Proceedings (Lecture Notes in Computer Science, Vol. 12607). Springer, 553-562. 
Hassan Mehmood, Ekaterina Gilman, Marta Cortes, Panos Kostakos, Andrew 
Byrne, Katerina Valta, Stavros Tekes, and Jukka Riekki. 2019. Implementing big 
data lake for heterogeneous data sources. In 2019 ieee 35th international conference 
on data engineering workshops (icdew). IEEE, 37-44. 

Amr A. Munshi and Yasser Abdel-Rady I. Mohamed. 2018. Data Lake Lambda 
Architecture for Smart Grids Big Data Analytics. IEEE Access 6 (2018), 40463- 
40471. https://doi.org/10.1109/ACCESS.2018.2858256 

Franck Ravat and Yan Zhao. 2019. Data Lakes: Trends and Perspectives. In 
Database and Expert Systems Applications - 30th International Conference, DEXA, 
Lecture Notes in Computer Science. Springer International Publishing, 304-313. 
Victoria Rubin and Tatiana Lukoianova. 2013. Veracity roadmap: Is big data 
objective, truthful and credible? Advances in Classification Research Online 24, 1 
(2013), 4. 

Shabnam Shadroo and Amir Masoud Rahmani. 2018. Systematic survey of big 
data and data mining in internet of things. Computer Networks 139 (2018), 19-47. 
Yan Zhao, Imen Megdiche, and Franck Ravat. 2021. Analysis-oriented Meta- 
data for Data Lakes. In 25th International Database Engineering Applications 
Symposium (IDEAS 2021). 


103 


COVID-19 Concerns in US: Topic Detection in Twitter 


Carmela Comito 
Nat. Research Council of Italy (CNR) 
Institute for High Performance Computing and Networking (ICAR) 
Rende, Italy 
carmela.comito@icar.cnr.it 


ABSTRACT 


COVID-19 pandemic is affecting the lives of the citizens worldwide. 
Epidemiologists, policy makers and clinicians need to understand 
public concerns and sentiment to make informed decisions and 
adopt preventive and corrective measures to avoid critical situa- 
tions. In the last few years, social media become a tool for spread- 
ing the news, discussing ideas and comments on world events. In 
this context, social media plays a key role since represents one 
of the main source to extract insight into public opinion and sen- 
timent. In particular, Twitter has been already recognized as an 
important source of health-related information, given the amount 
of news, opinions and information that is shared by both citizens 
and official sources. However, it is a challenging issue identifying 
interesting and useful content from large and noisy text-streams. 
The study proposed in the paper aims to extract insight from Twit- 
ter by detecting the most discussed topics regarding COVID-19. 
The proposed approach combines peak detection and clustering 
techniques. Tweets features are first modeled as time series. Af- 
ter that, peaks are detected from the time series, and peaks of 
textual features are clustered based on the co-occurrence in the 
tweets. Results, performed over real-world datasets of tweets re- 
lated to COVID-19 in US, show that the proposed approach is able 
to accurately detect several relevant topics of interest, spanning 
from health status and symptoms, to government policy, economic 
crisis, COVID-19-related updates, prevention, vaccines and treat- 
ments. 
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1 INTRODUCTION 


Twitter has long been used by the research community as a means 
to understand dynamics observable in online social networks, from 
information dissemination to the prevalence and influence of bots 
and misinformation. More importantly during the current COVID- 
19 pandemic, Twitter provides researchers the ability to study the 
role social media plays in the global health crisis. 

Since early days of COVID-19, several papers exploiting social 
media data to address the epidemics-related issues have been pro- 
posed. The research activities focused on different topics: from 
the analysis of people reactions and the spread of COVID-19 
[1, 15, 19, 20], to the search for conspiracy theories [9, 21] and 
the identification of misinformation propagation [11, 23]. A cer- 
tain number of research works are also devoted on the detection 
of trending topics of discussion about COVID-19 [1, 7, 10, 16, 24]. 

The approaches in literature aiming at detecting COVID-19 top- 
ics from social media use the well know LDA topic modeling algo- 
rithm. However, traditional approaches to topic detection like LDA 
are not effective in a dynamic context like social media [3], where 
word vocabulary dynamically changes over time. In contrast, LDA 
data structures used to represent the textual content of tweets are 
fixed in advance, as the size of word vocabulary, the set of terms 
used, and the number of topics produced. 

In the paper is described an alternative method able to detect 
COVID-19 topics on social media conversations exploiting the 
spatial-temporal features of the geo-tagged posts. Both numeric 
and textual features are extracted from the posts and their tempo- 
ral evolution is monitored along the time to identify bursts in their 
values. The method proposed consists of a two-phase approach 
that first detects peaks in the time series associated to the spatio- 
temporal features of the tweets, and then, clusters the textual fea- 
tures (either hashtags or words) exhibiting peaks within the same 
timestamp. The clustering approach is based on co-occurrence of 
the textual features in the tweets. Moreover, by performing clus- 
tering on the textual features, we separate groups of words and 
topics that are perceived as more relevant for the COVID-19 de- 
bate. Debates range from comparisons to other viruses, health sta- 
tus and symptoms, to government policy, economic crisis, while 
the largest volume of interaction is related to the lockdown and 
the other countermeasure and restrictions adopted to fight the pan- 
demics. 

The rest of the paper is organized as follows. Section 2 
overviews related work. Section 3 describes the target real-world 
dataset of tweets posted in US. Section 4 formulates the problem, 
introducing the key aspects of the approach and the topic detec- 
tion algorithm. The results of the evaluation performed over the 
real-world datasets of tweets are shown in Section 5. Section 6 con- 
cludes the paper. 
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2 RELATED WORK 


Since February 2020, an increasing number of studies focusing on 
the analysis of social media data related to the COVID-19 pan- 
demics have been proposed. The majority of such works collect 
large-scale datasets and share them publicly to enable further re- 
search. Some datasets are restricted to a single language such as 
English [12, 13], while some others contain multiple languages 
[4, 7]. The dataset differentiate also for the collection periods. 
Among all, [4] stands out as one of the long-running collections 
with the largest amount of tweets (i.e., 250 million), also thanks 
to its multilingual nature. However, these datasets mostly contain 
only the raw content obtained from Twitter except that in [12, 13] 
associate a sentiment score with each tweet. The dataset in [18] 
enriches the raw tweet content with additional geolocation infor- 
mation. With more than 524 million tweets, the dataset is more 
than twice as large as the largest dataset available to date. 

Another relevant bunch of works is devoted to the analysis 
of human behavior and reactions to the spread of COVID-19 
[1, 15, 19, 20]; others focus on the detection of conspiracy theories 
and social activism [9, 21]. In [11, 23] are proposed approaches on 
misinformation propagation and quantification related to COVID- 
19 using twitter. 

In the last years there has been a significant research effort on 
detecting topics and events in social media. This research line re- 
ceived also great attention for studying COVID-19 data on social 
media. Among the techniques defined for traditional data, and of- 
ten adapted for Twitter data, as reported in [3], the most repre- 
sentative method is Latent Dirichlet Allocation (LDA) [6]. LDA is a 
topic model that relates words and documents through latent top- 
ics. It associates with each document a probability distribution over 
topics, which are distributions over words. A document is repre- 
sented with a set of terms, which constitute the observed variables 
of the model. One of the main drawback of LDA is that data struc- 
tures used to represent the textual content of tweets are fixed in 
advance as the size of word vocabulary, the set of terms used, and 
the expected number of topics. 

Direct application of traditional approaches to topic detection 
like LDA on Twitter streams, as pointed out in [3], may give low 
quality results, thus, many different methods have been proposed 
[2, 5, 25, 26]. Generally, the common approach is to extract differ- 
ent data features from social media streams and then summarize 
such features for topic/event detection tasks by exploiting the in- 
formation over content, temporal, and social dimensions. The ap- 
proach proposed in this paper follows this view by introducing a 
set of spatio-temporal features to summarize tweet volumes user 
dynamics and location popularity and a set of textual features that 
summarize tweets content. Specifically, the main differences be- 
tween the proposed approach and the above works is in the sum- 
marization technique. In [2, 5, 25, 26], the textual content of tweets 
is represented with a traditional vector-space model, where the 
vector dimension is fixed in advance as the size of word vocabu- 
lary. In streaming scenarios, because word vocabulary dynamically 
changes over time, it is very computationally expensive to recali- 
brate the inverse document frequency of TF-IDF. Differently, we 
deal with content evolving signature structure of cluster by either 
updating frequencies of already present terms, or including new 
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terms; since we refer to term frequencies as relative values, there 
is no need of recalibrating the vocabulary size. 

By the best of our knowledge all the works so far proposed to de- 
tect COVID-19 topics from social media adopt the LDA approach, 
thus, all of them suffer of the just cited limitations. The main rel- 
evant drawback is that the number of topics should be fixed in 
advanced and the data structure and size has also to be fixed in ad- 
vanced. Anyhow, in order to give a complete view of what has been 
done to identify topics related to the COVID from social media, the 
current state-of-the-art is briefly surveyed below. 

In [16] authors use twitter data to explore and illustrate five 
different methods to analyze the topics, key terms and features, 
information dissemination and propagation, and network behav- 
ior during COVID-19. The authors use pattern matching and topic 
modeling using LDA in order to select twenty different topics on 
spreading of corona cases, healthcare workers, and personal pro- 
tective equipment. Using various analysis the authors were able to 
detect only few high level topic trends. Alrazaq et al. [1] also per- 
formed topic modeling using word frequencies and LDA with the 
aim to identify the primary topics shared in the tweets related to 
the COVID-19. 

Chen et al. [7] analyzed the frequency of 22 different keywords 
such as ’Coronavirus”, °Corona”, ”Wuhan”, analyzed across 50 mil- 
lion tweets from January 22, 2020 to March 16, 2020. Thelwall [24] 
also published an analysis of topics for English-language tweets 
during the period March 10-29, 2020. Singh et al. [23] analyzed dis- 
tribution of languages and propogation of myths. 

Sharma et al. [22] implemented sentiment modeling to under- 
stand perception of public policy on intervention policies such 
as "socialdistancing" and "workfromhome". They also track topics 
and emerging hashtags and sentiments over countries. Again, for 
topic detection they used LDA. 

Cinelli et al.[8] compared Twitter against other social media 
platforms Instagram, YouTube, Reddit and Gab to model informa- 
tion spread about the COVID-19. They analyzed engagement and 
interest in the COVID-19 topic and provide a differential assess- 
ment on the evolution of the discourse on a global scale for each 
platform and their users. We fit information spreading with epi- 
demic models characterizing the basic reproduction numbers RO 
for each social media platform. Also in this case, for topic detec- 
tion usrs applied standard LDA method. 

In [10] is described CoronaVis, a web application allowing 
to track, collect and analyze tweets related to COVID-19 gen- 
erated from the USA. The tool allows to visualize topic mod- 
eling, analyze user movement information, study subjectivity 
and to model the human emotions during the COVID-19 pan- 
demic. Those analysis is updated in real-time. They also share a 
cleaned and processed dataset named CoronaVis Twitter dataset 
(focused on United States) available to the research community at 
https://github.com/mykabir/COVID19. Also in this work for topic 
modeling authors used LDA and python pyLDAvis package to find 
out the relevant topics and produce an interactive visualization. 


3 DATASETS 


Real-world datasets of tweets posted in the United States have been 
used for the experimental evaluation. The choice to use tweets 
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Table 1: Keywords used to filter COVID-19 tweets. 


Keywords 

corona pandemic 
breathing lockdown 
outbreak distancing 
deadth quarentine 
covid-19 pneumonia 
virus drops 
Sars-cov-2 — stayhome 
China mask 
Wuhan positive 


from US is mainly motivated by the large availability of data, mak- 
ing possible meaningful data analytics. In fact, United States has 
the most geotagged tweet mentions of COVID-19, followed by the 
United Kingdom. 

Specifically, the dataset analysed is the one collected within the 
CoronaVis project [10] and accessible from the Github repository 
(https://github.com/mykabir/COVID19). Data were collected since 
March 5, 2020 using Twitter Streaming API2 and Tweepy3, and 
consists of over 200 million tweets related to COVID-19, which is 
about 1.3 terabytes of raw data saved as JSON files. The tweets 
have been posted by 30.070 unique users. Data is in the form 
(TweetID, TweetText, UserLocation, UserType). Some of the key- 
words use to filter COVID-19 related tweets are shown in Table 1. 


Figure 1 depicts the temporal tweet frequency over the time. The 
graph shows that the daily tweet frequency is overall rather high 
but varies considerably from one day to another, spanning from 
25,000 to 200,000, with some relevant peaks, e.g., on March 9th, 
end of April, mid May, June 21, end of June. The variability in the 
tweet frequency is also due to some gaps in the collected datasets 
caused by API and connectivity issues. Hence, in some of the date 
authors have fetched a fairly low amount of tweets compared to 
original number of the tweets on that day. To fill up those gapes 
we used another publicly available dataset, the GGoCOV19Tweets 
dataset [12]. 

The GeoCOV19Tweets dataset dataset [12, 14, 18] consists 
of 675,104,398 tweets (available from the web site https://ieee- 
dataport.org/open-access/coronavirus-covid-19-geo-tagged- 
tweets-dataset) and contains IDs and sentiment scores of the 
geo-tagged tweets related to the COVID-19 pandemic. The 
tweets have been collected by an on-going project deployed at 
https://live.rlamsal.com.np. The model monitors the real-time 
Twitter feed for coronavirus-related tweets using 90+ different 
keywords and hashtags that are commonly used while referencing 
the pandemic. 

Data provided in the first dataset was already preprocessed and 
ready to be analysed. For what concerns the second dataset a pre- 
processing step has been performed: reetweets were removed from 
the collected data together with punctuation, stop words, and non 
printable characters such as emojis from the tweets. Furthermore, 
various forms of the same word (eg, travels, traveling, and trav- 
elaAZs) were lemmatized by converting them to the main word 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


1 


(eg, travel) using the WordNetLemmatizer module of the Natural 
Language Toolkit Python library. 
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Figure 1: Tweets Frequency. 


4 THE METHOD 


4.1 Features modeling 


The proposed method is articulated as a sequence of steps. As first 
step, a set of space-time features are extracted from social data so 
as to achieve two parallel temporal analysis: (i) the evolution of 
tweets, of the users who posted such tweets, and the locations from 
where the tweets have been posted; (ii) the evolution of tweets 
content through a textual analysis in which we extract the most 
relevant and frequent keywords in the text. 

Accordingly, given a space-time window (e.g., a geographic re- 
gion R and a timestamp ft), the features are as follows. The number 
of tweets posted from R at time t is denoted as: 


xtTw, =|TWe| (1) 


The number of users of R at t represents the distinct users who 
have posted at least one tweet in the region R at time t. 


xu, = |Ut| (2) 


The number of locations within R at t represents the number of 
distinct locations from where users have posted at least one tweet. 


xp, = |Lel (3) 


The tweet entropy of R at t describes the distribution of tweets 
across the users, telling if users tend to tweet regularly in R at time 
t. For this purpose we use the Shannon Entropy: 


xn, =~ >) f(t wlogf(t, u), (4) 
u=1 


f(t, u) is the user’s proportion of tweets posted in R at time t and is 
IT Weu! 
defined as f(t, u) = TW] 
posted in R at time t by user u, while 7‘W; is the set of tweets 
posted in R at time t. 
The textual content of the tweets is represented by a set of 
features capturing the key elements of the text. Accordingly, the 
textual features like hashtags and words are extracted from the 


where TW; 1, is the set of tweets 
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tweets. To this purpose, a preliminary step is to extract and then 
preprocess the textual content of the tweets. The preprocessing of 
the tweets consists of the following steps: 

1. Stemming and lemmatization. Different tokens might carry out 
similar information (e.g. tokenizaiton and tokenizing). And we can 
avoid calculating similar information repeatedly by reducing all 
tokens to its base form using various stemming and lemmatization 
dictionaries. 

2. Removing stop words and punctuation. Some tokens are less 
important than others. For instance, common words such as "the" 
might not be very helpful for revealing the essential characteris- 
tics of a text. So usually it is a good idea to eliminate stop words 
and punctuation marks before doing further analysis. 

3. Computing term frequencies or tf-idf. For document clustering, 
one of the most common ways to generate features for a document 
is to calculate the term frequencies of all its tokens. Although not 
perfect, these frequencies can usually provide some clues about 
the topic of the document. And sometimes it is also useful to 
weight the term frequencies by the inverse document frequencies. 


After the textual data preprocessing, the entropy values of the 
textual features are extracted and used as feature to achieve textual 
analysis. In the rest of the paper we refer to keywords or textual 
features interchangeably. 

Since the textual features can be too much, to select the most 
relevant ones, among the most popular textual features (the ones 
with higher support), are selected the features with higher entropy. 

The support of a textual feature tf posted in a geographic area 
R in t, is the number of tweets containing tf posted in R at time t. 


s=|TWeeel (5) 


The entropy of a textual feature tf posted from region R at 
timestamp ft, describes the distribution of tweets containing tf 
across the users, telling if users tend to use regularly that textual 
feature in R at time t. For this purpose, the designed method uses 
the Shannon Entropy: 


Hyp, = — )) Ft, wlogflt, u), (6) 
u=1 


where f(t,u) is the user’s proportion of tweets containing tf 
IT We wtf | 
————— w 
ITW rep 

TW t.u,rf is the set of tweets containing tf posted in R at time t 
by user u, while 7 W, ;f is the support of the hashtag, that is the 
number of tweets containing tf. 

Accordingly, for each keywords in the tweets posted in R at time 
t with support higher than a given threshold, we build the time 
series of its entropy feature: 


posted in R at time t, and is defined as f(t, u) = here 


4.2 Topics extraction 


This section describes the approach proposed to identify topics 
from the tweets by exploiting the features introduced earlier. Given 
a space-time window, the algorithm provides a snapshot of what 
is going on in the specified geographic location at the given times- 
tamp. Both space and time granularity may vary from the broader 
detail level until getting to the finer one. For example, the space 
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granularity can be set to a state, city or the municipal districts of 
the city, to surroundings or specific areas, or even precise venues 
(i.e., public buildings, private houses, etc). In the same way the tem- 
poral coordinate ranges from months to weeks to days, hours and 
minutes. The proposed method is articulated as a sequence of steps. 
As first step, we extract the set of space-time features as describe 
in the above section. 

As second step, for each feature a time series is built. Then, an 
ad-hoc peak detection algorithm analyzes the time series and uses 
a score function to find peaks. The identification of peaks is mere- 
ley the first step in the process of topic detection. Content analysis 
of the tweets where the peaks occur is necessary to identify the 
specific topic. To this aim, we modeled the most significant textual 
features also as time series. After that a clustering algorithm will 
group the textual peaks to identify trending topics of discussion 
related to the COVID-19 outbreak. 

The key steps of the algorithm are peaks identification and topic 
detection through clustering of textual peaks. As basic principle, 
each element in the time series that deviates from a baseline pro- 
file is considered a peak. The baseline profile models normal user 
behavior and it is function of the time. and is computed through a 
scoring mechanism based on a statistics measure. 

Let S be a function which associates a score S(x;,Tx) to the 
element x; of a given time-series Tx. A given point x; in Tx isa 
peak if S(x;, Tx) > 0. A data point in a time-series is a local peak 
if (a) it is a large and maximum value within a time window; the 
value need not necessarily be globally maximum in the entire time- 
series; and (b) it is isolated ie., not too many points in the window 
have similar values. To the purpose of the experimental evaluation 
presented in this paper, the time window is 6 hours and the time 
series spans over 4 days, thus it consists of 16 timestamps. As we 
are interested in identifying topics in nearly real-time we consider 
only the previous temporal values in the time series. Clearly, the 
approach also applies in for retrospective analysis. In this case, also 
the values successive in time series can be considered. 

For a given point x; in Tx, the score function computes the dis- 
tances of x; from its k left neighbors. Such distances can be com- 
puted by using a proper statistics measure. According to most of 
the literature work, the most suitable statistics for detecting peaks 
in time series are the median and the percentile. Accordingly, the 
median or percentile of the k preceding temporal values in Tx have 
to be computed. 

To identify peaks the algorithm builds a scoring array S to 
compute the deviation of each element of the time series from 
the baseline profile obtained using the statistic s. The score is 
computed using the following score function: 


xX * 
S(xt, Ti) 2G baseline (7) 
Xt 


where Xpaseline is the baseline value for the time series T; at 
time t, obtained applying the statistic s on the k points immediately 
preceding t in T;. Notice that each time series Tj; covers the time 
span from t — k tot. 

After peaks are identified, if the set of hashtags and words with 
textual peaks is not empty, the method activates the clustering al- 
gorithm that will be responsible for the topic detection. 
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To identify different topics emerging simultaneously in the tar- 
get space-time window, we propose a clustering algorithm that ex- 
ploits the co-occurrences of the keywords in the tweets. In par- 
ticular, keywords exhibiting peaks, within the timestamp t¢, are 
grouped if they co-occur in a number of tweets based on a thresh- 
old value 6. The clustering algorithm allows to uniquely identify 
topics: each cluster c € C is associated to a topic and all the key- 
words in the cluster characterize the topic. 


5 EXPERIMENTAL RESULTS 


This section presents the preliminary results of the experimental 
evaluation performed on the dataset of geo-tagged tweets collected 
in US and presented in Section 3. The aim of the evaluation is to 
assess the effectiveness of the proposed topic detection approach 
and to identify the main COVID-19 related topics of discussion in 
US, in the period spanning from March 5 to June 21. 


5.1 Performance Evaluation 


The quality of the results obtained with the proposed method de- 
pends on the parameters of the two main phases of the approach: 
peaks identification, and clustering of the textual peaks. The key 
parameters of the peak detection step are the (i) statistics measure 
used for the baseline values and (ii) the set of features used to iden- 
tify peaks. For what concerns the clustering, the parameters are 
the (i) support of the textual features and (ii) co-clustering thresh- 
old. The rest of the section evaluates the method performance with 
respect to such parameters. 

To assess the performance of the peak detection step, the effec- 
tiveness of the different features have been evaluated, considering 
both median and percentile. The evaluation has been performed in 
terms of the following metrics: 

e Detected topics. It is the percentage of detected topics: 


_ Detected Topics 
DI= TotalTopics 


e Topic Rate. It is the percentage of peaks of numeric features 


that become topics: 
_ Detected Topics 
TR= Total Peaks 


e Distinct Topic Rate. It is the percentage of unique topics out 


to the total number of topics detected: 
_ Unique Topics detected 
DIR= Detected Topics 


Results are shown in Figure 2. Several peaks of varying intensity 
and frequency have been identified for both numeric and textual 
features. However, not all the peaks correspond to a topic as the 
proposed approach identifies a topic through the clustering algo- 
rithm when both numeric and textual peaks are detected. 

In general, the number of tweets (TW ) is the feature produc- 
ing the higher number of peaks; however, many of such peaks do 
not correspond to any topics, therefore the topic rate is the lowest. 
For what concerns the number of users feature (U ), it produces 
also a relevant number of peaks even if lower than the one using 
the number of tweets and, differently to that, many of them are 
associated to trending topics with a good topic rate and a reason- 
ably good percentage of detected topics. By using the number of 
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locations (L ) as feature, the peaks obtained are rather small but 
almost all the peaks are actually associated to a topic with a conse- 
quent high topic rate. However, by using the location only a small 
percentage of topics are detected. The last considered feature is 
the entropy of tweets (H ). With such a feature the peaks detected 
are substantially lower than the ones identified with the number 
of tweets, and conversely to that, almost all the peaks are associ- 
ated to topics with the higher topic rate and percentage of topics 
detected. The entropy resulted to be the most discriminative fea- 
ture: significant and concurrent changes in the number of tweets 
and in the number of users, as modeled by the entropy, is sign of 
interesting and really popular topics. 

For what concerns the performance achieved by the different 
statistic metrics used to obtain the baseline threshold, the per- 
centile measure is the most selective and, for such a reason, it does 
not intercept low peaks (e.g., the ones associated to topics that 
were quite popular across temporal windows, and, thus exhibiting 
a number of peaks of similar intensity in a close temporal horizon). 
On the contrary, the median it is able to catch even low intensity 
peaks and for such a reason some of them do not actually corre- 
spond to any topics. Therefore, even if the percentage of detected 
topics is higher, the topic rate is lower than the one achieved when 
using the percentile. 

Results show that the optimal choice is the use of the median 
as statistics measure and the entropy of tweets as feature, setting 
that will be used throughout the experimental study. 

To identify the optimal values of the clustering parameters, Ta- 
ble 2 evaluates the clustering structure obtained with different val- 
ues of the co-clustering threshold and support, in terms of percent- 
age of detected topics, topic rate and distinct topic rate. 


Support Co-clustering DT(%) TR(%) DTR(%) 


threshold 
s=0.25 o= 02 25% 8% 7% 
s=0.25 6 = 0.4 40% 12% 15% 
s=0.25 6 = 0.6 63% 25% 28% 
s=0.25 6 = 0.8 60% 30% 26% 
s=0.50 O-= 02 34% 70% 20% 
s=0.50 6 = 0.4 58% 75% 43% 
s=0.50 6 = 0.6 80% 89% 95% 
s=0.50 6 = 0.8 62% 86% 66% 
s=0.75 o= 02 43% 72% 30% 
s=0.75 6 = 0.4 66% 74% 55% 
s=0.75 6 = 0.6 75% 89% 72% 
s=0.75 6 = 0.8 72% 82% 73% 


Table 2: Evaluation of the clustering structure w.r.t. cluster- 
ing parameters. 


Table shows that for small values of both support and co- 
clustering threshold the algorithm produces big clusters grouping 
together tweets that are poorly connected and, thus, discussing 
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Figure 2: Performance of the different features w.r.t static metrics. 


about different topics. This results in very bad values of the three 
evaluation metrics: low percentage of detected topics and a low 
topic rate. With the increasing of both support of the textual fea- 
tures and co-clustering threshold the algorithm groups better the 
tweets and its performance gradually improves. However, values 
of the clustering parameters too high deteriorate the algorithm per- 
formance. If the support of the textual features is too large, there 
is the risk of not considering smaller textual peaks corresponding 
to relevant topics, and consequently, the detection rate decreases. 
This could be, for example, the case of a topic described by several 
textual features that exhibits smaller peaks. On the other hand, for 
large values of the co-clustering threshold the algorithm does not 
group related tweets and creates a high number of small clusters 
with few tweets. Therefore, for the same topic several clusters are 
created. This results in a high percentage of clusters that became 
incorrectly topics, with a high percentage of duplicated topics and, 
thus, a small distinct topic rate. Summarizing, the optimal param- 
eter setting resulted to be s = 0.50, 6 = 0.6. 


5.2 COVID-19 related topics in US 


In this section, we take a look at the content of the conversation 
taking place on Twitter about COVID-19 in US. 


=—— back 
—— black oe help 
—— breaking 


5000 — cases — house 


china 
could 


crisis 
—— days 
— death 
deaths 


the tweets change slightly: for example, on march 8 the most used 
were pandemic, deadth, test, positive; on March 19 news hashtags 
and words appeared like #realdonaldtrump, week, stops, president. 
On March 23 st, positive, tested, patient, test, relief: If we go further 
with time we find death, even, test, health, positive, cases. 

The most affected US states till June 21, were California with 
111.162 cases, Florida with 63,115 cases, New York with 48822 
cases, Texas with 107232 cases, Washington with 50311 cases. Fig- 
ures 4, 5, 6 show the textual features in California, Florida and New 
York state, respectively. As can be noted there are several peaks 
underlining the popularity of the textual features that varies with 
time. In particular, Figures 4-6 show that the states of California, 
Florida and New York present the same highest peaks along the 
same or very close temporal interval, and such peaks are recur- 
rent throughout the period. This is the case of the textual peaks 
like die, testing center, death, mask, vaccine, spread. 
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Figure 4: Overall textual peaks in California. 


As an example, the situation in California is examined in more 
Mari5 Mar29 Apr i2 Apr26 Mayi0 May 24 Jun 7 Jun 21 4 7 ; . F 
2020 detail. Figure 4 shows that in California there are several textual 
peaks throughout the period. Peaks present different intensity and 
duration. Some of the textual features exhibit relevant peaks only 
the first time they appear like corona outbreak, emergency, food, re- 
mote working, toilet papers, milk; other textual features have recur- 


Figure 3: Overall textual peaks in US. 


Figure 3 shows the most popular textual features all over US 


states. At the beginning of the epidemics the most used words were 
corona, outbreak, spread, home. Along the time, the words used in 
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rent peaks along the observed weeks like death, die, spread, fake, 
cases, symptoms, vaccine, quarantine, breathing. In such last case 
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oe several countries. Tweets about quarantining people infected or 
cane suspected to have COVID-19 are grouped in the topic Quarantine. 
At the same time with epidemics spreading the reactions of peo- 
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2020 i. Social networks have become a popular tool not only for infor- 
— ‘so  Mation dissemination and individual opinion-making. Analyzing 
social network discussions can give us a perception of society and 
the world. In the current situation caused by COVID-19, under- 


standing people awareness is extremely important. In this paper, 


Figure 5: Overall textual peaks in Florida. 
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Figure 6: Overall textual peaks in NY. 


there are few relevant peaks and the intensity of the textual fea- 
tures remains rather constant overall. Other features like mask or 
testing center present isolated peaks meaning that they become 
very popular in specific days. 

Table 3 shows the top 12 topics detected in US in the target pe- 
riod, using as numeric feature the entropy of tweets and the en- 
tropy of the textual features; to obtain baseline values to extract 
peaks from the time-series the median is used. 

The table highlights that the one of the most popular topic is 
related to the epidemics spreading with discussions about corona 
outbreak, deaths, COVID-19 cases, quarantine, breathing illness, 
emergency, mask, vaccines. Another relevant topic throughout the 
period is the topic Death related to the deaths caused by COVID-19 
in the different US states. The topic Travel issues is related to the 
effects of COVID-19 on travel. Discussions were about flight can- 
cellations, postponements, restrictions as well as travel warnings 
imposed by many countries due to the pandemic. In the topic Panic 
buy are grouped the tweets discussing about how people started 
to buy food and supplies due to lockdowns, and stay-at-home or- 
ders due to the COVID-19 pandemic, and how supermarkets and 
shops controlled and prevented panic buying. Another topic iden- 
tified was the one related to racism. Specifically, users in most of 
the tweets reported the spreading of racist and xenophobic attacks 
about the disproportionate toll the Covid-19 pandemic was tak- 
ing on communities of color. In the wake of the killing of George 
Floyd, the activism has intensified: doctors, epidemiologists, and 
nurses are increasingly abandoning their characteristic reticence 
in favor of direct advocacy. The topic Medical supplies concerns 
tweets about the importance of facial masks and gloves as preven- 
tion measures to reduce the outbreak and also their shortage in 
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sing and the entropy of tweets, the number and the movements of dis- 


tinct users; and (ii) textual features, ie. the entropy of relevant 
hashtags. Each space-time feature is then modeled as a time se- 
ries. Peaks of textual features co-occuring in a significant number 
of tweets are then associated to the same topic through a cluster- 
ing algorithm based. Results, performed over real-world datasets 
of tweets, shown the feasibility of the proposed approach that is 
able to detect a large number of relevant COVID-19 topics. 
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ABSTRACT 


This year marks the silver anniversary of IDEAS. It has 
been an exciting quarter century to shepherd this meeting 
through good times and not so good ones. We have survived 
Ebola, MERS and SARS. Whereas the others were local, the 
Covid pandemic, which still rages, has forced us to move 
to an on-line version, but thanks to the participants and 
the dedicated program committee we have continued. ‘This 
paper is a photographic journey through the years of IDEAS. 
Unfortunately we have not been able to have the images 
of all participants over the quarter century oi IDEAS. Just 
a sampling of some of the fond moments during the social 
gatherings of the IDEAS family. 


CCS CONCEPTS 
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ing; e Social and professional topics; e Applied com- 
puting; e Security and privacy; 
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1 INTRODUCTION 


Over the last 25 years, IDEAS has visited many cities, coun- 
tries and three continents. We were able to hold IDEAS in 
Hong King insoite of SARS - luckily it was under control 
by the time of the meeting. Unlike the other epedemics, the 
current COVID pandemic is taking its toll on people lives. 
We were forced to move IDEAS2020 which was to be hosted 
in S. Korea to an on-line version. We were hopeful that 
the incidence of epidemics, like EBOLA, MERS and SARS 
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would be short lived. We were hoping that, like the previous 
epidemics, COVID would be limited and there would not 
be further disruptions after the summer of 2020! Alas the 
pandemic has its own agenda and the first wave was followed 
by a more deadly second wave and rapidly by a third. The 
organizing committees seeing this and the slow progress in 
the vaccination efforts, reluctantly decided to move our silver 
jubilee event to on-line as well. In spite of this last minute 
change, we were glad to receive a large number of submissions. 
The current proceedings is an evidence of the high quality of 
work in database, and data analytics. 


1.1 Our meetings 


Following is the list of the IDEAS meetings over the past 
quarter of a century. 


e IDEAS 1997: International Database Engineering and 
Applications Symposium, Aug. 25-27, 1997, Concordia 
University, Montreal, QC. Canada, Eds. Bipin.C. De- 
sai, Barry Eaglestone: IEEE Computer Society 1997; 
ISBN: : 0-8186-8114-4; 

DOI: https: //doi.org/10.1109/IDEAS.1997 


IDEAS 1998; 2nd International Database Engineering 
and Applications Symposium, July 8-10, 1998. Univer- 
sity of Cardiff, Eds. Barry Eaglestone, Bipin C. Desai, 
Jianhua Shao: IEEE Computer Society 1998; 

ISBN: 0-8186-8307-4; 

DOI: https: //doi.org/10.1109/IDEAS.1998 


IDEAS 1999: 3rd International Database Engineering 
and Applications Symposium, August 2-4, 1999, Con- 
cordia University, Montreal, Canada, Eds. Bipin C. 
Desai, Gosta Grahne: IEEE Computer Society 1999; 
ISBN: 0-7695-0265-2; 

DOI: https: //doi.org/10.1109/IDEAS.1999 


e IDEAS 2000: 4th International Database Engineering 
and Applications Symposium, IDEAS 2000, September 
18-20, 2000, Keio University, Yokohoma, Japan, Eds: 
Bipin C. Desai, Yasushi Kiyoki, Motomichi Toyama:[EEE 
Computer Society 2000; 

ISBN: 0-7695-0789-1; 
DOI: https: //doi.org/10.1109/IDEAS.2000 
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IDEAS 2001: 5th International Database Engineering 
and Applications Symposium, July 16-18, 2001, Greno- 
ble, France: Eds: Michel E. Adiba, Christine Collet, 
Bipin C. Desai: IEEE Computer Society 2001; 

ISBN: 0-7695-1140-6; 

DOI: https: //doi.org/10.1109/IDEAS.2001 


IDEAS 2002: 6th International Database Engineer- 
ing and Applications Symposium, July 17-19, 2002, 
Edmonton, Canada, General Chair. Bipin C. Desai, 
Eds. Mario A. Nascimento, M. Tamer Ozsu, Osmar R. 
Zaiane: IEEE Computer Society 2002; 

ISBN: 0-7695-1638-6; 

DOI: 10.1109/IDEAS.2002 


IDEAS2003: 7th International Database Engineering 
and Applications Symposium, 16-18 July 2003, Hong 
Kong, China. Ed. Bipin C. Desai, Wilfred Ng: IEEE 
Computer Society 2003; 

ISBN: 0-7695-1981-4; 

DOI: https: //doi.org/10.1109/IDEAS.2003 


IDEAS 2004: 8th International Database Engineering 
and Applications Symposium (IDEAS 2004), 7-9 July 
2004, Coimbra, Portugal. Ed. Bipin C. Desai, Jorge 
Bernardino: IEEE Computer Society 2004; 

ISBN: 0-7695-2168-1; 

DOI: https: //doi.org/10.1109/IDEAS.2004 


IDEAS 2004DH: IDEAS Workshop on Medical Infor- 
mation Systems: The Digital Hospital (IDEAS-DH’04), 
1-3 Sept. 2004, Beijing, China — 2004 Ed: Bipin C. 
Desai; Keqin Rao; Baoluo Li; Fulin Zhang; 

ISBN: 0-7695-2289-0; 

DOI: https: //doi.org/10.1109/IDEADH.2004 


IDEAS 2005: 8th International Database Engineering 
and Applications Symposium (IDEAS 2005), 25-27 July 
2005, Montreal, Canada. Ed. Bipin C. Desai, Gottfried 
Vossen: IEEE Computer Society 2005; 

ISBN: 0-7695-2404-4; 

DOI: https: //doi.org/10.1109/IEEECONF10944.2005 


IDEAS 2006: 10th International Database Engineering 
and Applications Symposium (IDEAS 2006), 11-14 De- 
cember 2006, Delhi, India. Ed. Bipin C. Desai, Shyam 
K. Gupta: IEEE Computer Society 2006; 

ISBN: 0-7695-2577-6; 

DOI: https: //doi.org/10.1109/IEEECONF11732.2006 
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and Applications Symposium (IDEAS 2007), Septem- 
ber 6-8, 2007, Banff, Alberta, Canada. Ed. Bipin C. 
Desai, Ken Barker: IEEE Computer Society 2007, 
ISBN: 0-7695-2947-X, 

DOI: https: //doi.org/10.1109/TEEECONF13176.2007 


IDEAS 2008: 12th International Database Engineering 
and Applications Symposium (IDEAS 2008), Septem- 
ber 10-12, 2008, Coimbra, Portugal. General Chair: 
Bipin C. Desai, ACM International Conference Pro- 
ceeding Series 299, ACM 2008, 

ISBN: : 978-1-60558-188-0, 

DOI: 10.1145/1451940 
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2 A PICTORIAL LOOK BACK 


Over this quater century, IDEAS had many participants; 
some have been graduate students or in the early phase of 
thier carreer; others were more established. Many of these 
conributors have continued to particpate in these annual 
meetings. Others are either retired or making plans! The 
rod has passed from these to new ones who keep joining 
this IDEAS family. The following photos have some fond 
memories to be shared! 
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ABSTRACT 1 INTRODUCTION 


Modern day application development requires efficient manage- 
ment of huge RDF data. The major approaches for RDF data manage- 
ment are Relational and Graph based techniques. As the relational 
approach suffers from query joins, we propose a semantic aware 
graph based partitioning method. The partitioned fragments are 
further allocated in a load balanced way. For efficient query pro- 
cessing, partial replication is implemented. It reduces Inter node 
Communication thereby accelerating queries on distributed RDF 
Graph. This approach has been demonstrated in two phases parti- 
tioning and Distribution of Linked Observation Data (LOD). The 
time complexity for partitioning and distribution of Load Balanced 
Semantic Aware RDF Graph (LBSD) is O(n) where n is the number 
of triples which is demonstrated by linear increment in algorithm 
execution time (AET) for LOD data scaled from 1x to 5x. LBSD 
has been found to behave well till 4x. LBSD is compared with the 
state of the art relational and graph-based partitioning techniques. 
LBSD records 71% QET gain when averaged over all the four query 
types. For most frequent query types, Linear and Star, on an average 
65% QET gain is recorded over original configuration for scaling 
experiments. The optimal replication level has been found to be 
12% of original data. 
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A W3C standard, Resource Description Framework (RDF) is a foun- 
dation of semantic web and used to model web objects. An RDF 
dataset comprises triples in the form of (subject, property, and ob- 
ject). It can be readily comprehended as a graph, where subjects 
and objects are vertices joined by labeled relationships i.e., edge. 
It is however now being used in a broader context. Bio2RDF[4] 
data collection is used by biologists to store their experimental 
results in RDF triples to support structural queries and communi- 
cate among themselves. Similarly, DBpedia[5] extracts information 
from Wikipedia and stores it as RDF data. W3C offers a structured 
query language, SPARQL to retrieve and manage the RDF datasets. 
Finding an answer to the SPARQL query requires finding a match 
of the subquery graph in the entire RDF graph. As the RDF data is 
gaining acceptance widely, RDF dataset sizes are moving from a 
centralized system to distributed system. 

There are two techniques for RDF data management: relational 
and graph-based. In the relational method, data is kept in the form 
of multiple tables. To find an answer to a query, one needs to 
extract that information from various tables by applying the join 
operation. On the other hand, in the graph-based technique data is 
represented in the form of vertices and edges. Semantic partitioning 
[23] is one of the graph partitioning technique, implemented for a 
centralized system using page-rank algorithms. To work towards 
building efficient partitioning and distribution algorithms, there 
are many state of the art available. 

Some of the partitioning algorithms use the query workload 
to identify the parts of the RDF graph which are used frequently 
and keep these subgraphs at one site. While this approach works 
well for the systems in which the majority of queries follow the 
identified query patterns, it may not work as well in the systems 
where new queries do not correlate with the existing workload. 
The configuring system that doesn’t use workload information is 
desirable. Instead, if we use the semantics of RDF to partition the 
data, algorithm execution time would be much lower and query 
execution time for new queries would either be the same or better 
than the workload aware methods. Semanticity of RDF data refers 
to the format of triples in a Turtle or N-Quad RDF file. This triple 
data file can directly be used for partition and distribution using 
the fact that the edge is denoted by the equivalence of subject and 
object in two triples. Using this structure of triples, one can directly 
work on complexities that are based on the number of triples in a 
file. 

Reviewing such kind of aspects and agendas available in graph- 
based techniques, this research is designed to develop algorithms to 
partition data using semantic relation between vertices and distrib- 
ute among several nodes. Load Balanced Semantic Aware Graph 
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(LBSD) uses semantic partitioning, for the initial phase of parti- 
tioning. The system partitions data and makes clusters. At that 
point, it will disseminate applicable bunches (by semantic connec- 
tion) among the given number of hubs. The fundamental reason 
to segment RDF information is to answer inquiries effectively in a 
lesser measure of time. To reduce inter node communication(INC) 
in distributed environment, partial replication [13] of data has been 
done. It is demonstrated by deciding how much amount of data 
should be replicated over every node to reduce INC. 

The rest of the paper is organized as follows: in the next section, 
we discuss related work regarding this research. In Section 3, we 
discuss the methodology used to implement this work. Section 4 
describes the details of experiments and evaluation parameters. In 
Section 5, we discuss the results and comparison of the system 
with the state of the art work, and then finally Section 6 states the 
conclusion. 


2 RELATED WORK 


The present approaches for handling the huge RDF data can be clas- 
sified into two categories; Relational and Graph-based approaches. 


2.1 Relational Approaches 


RDF triples can naturally be implemented as a single table with 
three columns specifically subject, predicate object. This table can 
have millions of triples. This approach aims to utilize the well de- 
veloped techniques in conventional relational techniques for query 
processing, and storage of data. Research in relational techniques 
deals with the partitioning of RDF tables in such a way that there 
is a substantial decrease in the number of joins while answering a 
query. 

Property tables approach utilizes the repeated appearances of 
patterns and stores correlated properties in the same table. Class 
property table and clustered property table are two techniques 
in which the former defines various tables that contain a particu- 
lar property value while the latter defines a table for a particular 
subject[1]. 

DWAHP [19] is the relational technique partitions the data us- 
ing workload aware approach using n-hops property reachability 
matrix. Clustering of Relational data in distributed databases for 
medical information is discussed in [21] which is also similar kind 
of the state of the art work for relational systems. It uses Horizontal 
Fragmentation for the implementation. This technique is imple- 
mented for relational approach and this research LBSD discusses 
the same for graph-based approach. 

The relational approach for SPARQL-based query known as 
Direct relational mappings in which a SPARQL query can be trans- 
lated to SQL query for given data in the form of the triple [24]. 
Another technique is single table extensive indexing which is used 
to develop native storage systems that allow extensive indexing 
of the triple table. e.g. Hexastore and 3X [18]. SIVP [22] proposes 
Structure Indexed Vertical Partitioning which combines structure 
indexing and vertical partitioning to store and query RDF data for 
interactive semantic web applications. It presents five metrics to 
measure and analyze the performance of the SIVP store. SIVP is 


better than vertical partitioning provided the extra time needed 
in SIVP, which consists of lookup time and merge time, is com- 


pensated by frequency. Above all are relational approaches which 
closely relate to LBSD in some or other way. 
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2.2 Graph Based Approaches 


The graph-based technique eliminates query joins. It maintains the 
original representation of the RDF data and implements the seman- 
tics of RDF queries but it scales poorly. Several recent works deal 
with RDF graph partitioning. gStore[24] is a system designed to ex- 
ploit the natural graphical structure of RDF triples. It also executes 
the queries using the subgraph matching approach. The Graph- 
based technique, Adaptive partitioning and Replication (APR) [2] 
works to partition query graphs using Workload information, and 
then it decides the benefit level to make a certain decision that how 
much data should be replicated in the given graph. 

Another approach is UniAdapt [3]. This technique proposes a 
unified optimization approach that enables a distributed RDF Triple 
Store to adapt its RDF Storage layer by focusing on replication as 
well as main memory indexes. The final objective for this approach 
to decrease future query execution time. METIS [14] is one of the 
popular baselines for multiple works. [10] [9] [7]. APR [2] first 
partitions the graph using METIS and then uses a global query graph 
made using workload for replication. To handle query processing 
efficiently in distributed environment, edge-cut novel approach 
using two strategies 1. Overpartitioned minimal edge-cut cover. 
2. Our novel molecule hash cover [12]. Its analysis substantiates 
hypothesis by explaining the causes for their good performance. 
Both strategies reduce query execution time on our set of test 
queries (between 5% and 98%). 

The other approach uses the semantic properties of RDF data 
and proposes a Page Rank inspired algorithm to cluster the RDF 
data [23]. This approach is implemented for centralized system 
whereas proposed technique LBSD inspired by the same but works 
for distributed systems. One more recent approach [20] uses the 
frequency of query patterns to partition the graph and proposes 
three methods of fragmentation. Other than relational and graph- 
based approach there are approaches which deal with index, dataset 
formats and storage structure. While partitioning and distributing 
data, the index of data fed to the system and the format of data 
are also key features. Several partitioning techniques available to 
handle query workload for static partitioning, which turns into 
the result that 40% query remains unanswered [11]. These types of 
shortcomings are resolved in [8], which handles dynamic ranged 
partitioning using workload information. 

Present work in graph based RDF Data management needs work- 
load information for partitioning and distribution and does not 
address load balancing issue. The present research work in seman- 
tic aware partitioning is demonstrated on centralized system only. 
These limitations of present work motivates this research to imple- 
ment LBSD to support semantic aware partitioning in a distributed 
environment. LBSD has two phases: 1. Semantic aware partition- 
ing 2. Distribution using partial replication. It aims to reduce the 
communication cost during SPARQL query processing. It adap- 
tively maintains some frequent access patterns (FAPs) to reflect the 
characteristics of the workload while ensuring the data integrity 
and approximation ratio. To reduce INC, data should be replicated 
among all local nodes by its semantic relation and for that, a partial 
replication technique can be used. The partial replication technique 
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Algorithm 1: Extraction of popular Nodes 
Input: RDF Triples, no. of fragments 
Output: list of frequent subjects 
1 Function EXTRACTION(RDF Triples,k fragments): 
2 hashmap h with key = subject and value = frequency 
3 for each triple in triples do 
4 i: h(subject)++ 


5 sort the hashmap by value 


6 return top k subjects 


decides the replication level using certain criteria and replicates the 
vertices which are most frequently used or most relevant. LBSD uses 
the similar technique for graph based approach using Centrality 
concept. 


3 RESEARCH METHODOLOGY 
The LBSD aims to distribute RDF data using graph based approach 


over available nodes to reduce inter-node communication (INC). 


The methodology divided into two phases. First Phase is Semantic 


aware Partitioning of RDF Data which consists of two algorithms. 


Algorithm 1 is used for extraction of popular nodes and algorithm 
2 is used for partitioning. The Second Phase is Distribution of RDF 
Data, includes algorithm 3 and algorithm 4 for distribution and 
replication respectively. Figure 1 depicts the same. As shown in 
Figure 1, first available datasets of RDF Data will be transformed 
from CSV to ttl data file to set input into graph-based tools. The .ttl 
data file will be as tripled data which then will be fragmented and 
distributed in upcoming phases. 


3.1 Partitioning of RDF Graph 


Our aim for designing a fragmentation algorithm is to reduce INC, 
especially for linear and star queries. For example, social media 


data may have frequent star queries to get the friends of a person. 


RDF data has an advantage because it represents the data in the 
form of triple < subject, predicate, object >. First, we need to find 
out the subjects which have many outgoing degrees. If we put these 
popular subjects at different nodes, then we can get rid of INC for 
star queries. 


CSV to .ttl Graph based 
datafile tools 
“4 RDEData. f 


Load Balanced Semantic aware Graph Database 


Partitioning Distribution Replication 
[Semantic [Load [Partial 
Aware] Balancing] Replication] 


Figure 1: Block diagram for LBSD 


Each of these popular subjects is known as the master nodes. We 
can do this by sending the incoming query to the node in which the 
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corresponding subject resides. Suppose there are k fragments, then 
we need to find k subjects with the highest number of outgoing links. 
Algorithm 1 lists out the steps needed to do that. Since algorithm 1 
goes through all the triples only once, it has a linear time complexity 
O(n) where n is the number of triples. Sorting HashMap would 
take O(m*logm) where m is the distinct number of subjects. The 
number of triples is always much greater than the number of distinct 
subjects, we can safely ignore it. 

In algorithm 2 after getting the most important subjects, we 
allocate the triples corresponding to that subject to a cluster. We 
then obtain k fragments. To allocate the remaining triples to these 


fragments, we need to find out the degree of closeness of each triple 
with the existing fragments. Given a triple, t not yet assigned to 


any fragment, we find out which fragment has the most number 
triples which contain the object equivalent to the subject of triple, 
t. The triple t and all other triples which share the same subject 
which we call the secondary master node are then added to that 
cluster. This method is continued for the rest of the triples. Here 
each of the triple not part of initial partitioning is compared with 
every existing triple of the partitions. We do the preprocessing of 
partitions and use HashMap to record the occurrences of objects in 
a partition. Since every subject of the remaining triple is checked in 
the HashMap, the worst time complexity of the algorithm becomes 
O(n*k) where n is the number of triples and k is the number of 
fragments or partitions. 


Algorithm 2: Semantic Aware Partitioning 
Input: RDF Triples, frequent subjects,k fragments 
Output: k partitions 

1 Function SAPartition(RDF Triples, frequent subjects,k 


fragments): 
2 for each fragment f=1 to k do 
3 i fragment fi = matches of triples with subject si 
4 make hashmap h for each fragment i where key = object 
and value = frequency for rest of the triples do 
5 a put triple in fragment with maxhi(object) 
6 return k fragments 


3.2 Distribution using Partial Replication 


When the user submits a query to the coordinator node, it will be 
answered using graph traversal from all the available nodes in the 
distributed environment in LBSD. This section includes details of 
the replication and distribution strategy. 

After the fragmentation of the dataset, it is not necessary that 
we get fragments that are almost equal in size primarily because 
the frequency of outgoing edges is not uniformly distributed in the 
triples. While some nodes might have a high number of outgoing 
edges, others might barely have that many outgoing edges. This 
might lead to skewed distribution, which will result in unequal 
load distribution and delayed query execution time. To mitigate 
this problem, we calculate the sizes of the fragments and allocate 
them to different sites in such a way that there is an approximately 
equal load at each of the sites. So, a fragment of bigger size should 
be placed with a fragment of smaller size. 
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Algorithm 3: Distribution of Clusters 


Input: fragments 
Output: clusters 
1 Function Allocation(fragments): 


2 Array A of fragment, size of fragment for every 
fragment f do do 

3 i A.insert(F,sizeof(F) 

4 sort A in descending order of size for every index iin A 
do do 

5 lL Add A[i].fragment to cluster with lowest size 


For Replication, LBSD uses centrality measurement. Degree cen- 
trality measures the number of incoming and outgoing relation- 
ships from a node. The Degree Centrality algorithm can help us find 
popular patterns(subject-object) in a graph. This is built-in feature 
of Neo4j[16]. For each predicate, we can measure the centrality 
that how much they are connected to subjects. Having the same 
centrality predicates should lie on the same host in a distributed 
environment. Here for our experiment, centrality lies between inter- 
val [0.14,1]. The highest centrality is 1, when the count of a number 
of subjects and number of labeled edges become same. Centrality 
helps to replicate and distribute data among available nodes. 


Algorithm 4: Replication 
Input: Triples 
Output: Replicated Data 

1 for each cluster c=1 ton do 


y for each Predicate p do 
3 if cen(p) > Max|Ap] // MAx[Ap]= centrality 
of max.occurring predicate 

4 then 

5 for i=1 tondo 

6 Pi =p 

7 Data will be replicated on node for cluster 
c[i=1 to n] 


Replication replicates the data to the available nodes in the dis- 
tributed system. Partial replication only replicates a few amounts of 
data that satisfy the given threshold value or cut-off. Here we have 
frequent patterns and its centrality. According to top k subjects 
analysed from algorithm 1, will have top k patterns. That means 
properties associated with those subjects. These top k patterns help 
to decide the replication level. So, the centrality of the top pattern 
becomes the threshold value for partial replication which is known 
as Max. [Ap]. For example, some subject k1 is there in list of k 
subjects having centrality 0.58, then patterns of centrality between 
0.58 to 1 will be replicated . Here in LBSD it is 0.65 i.e. patterns’ cen- 
trality between 0.65 to 1 were replicated counted as most frequent 
one. 


4 EXPERIMENATL DETAILS 


The hardware setup consists of Intel® Core (TM) i3-2100 CPU@ 
3.10GHz 3.10 GHz 8GB. The software setup consists of Neo4j Desk- 
top 1.1.10 [16] and for visualization neo4j browser version 3.2.19 is 
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used. We have used NeoSemantics[17] to upload rdf supported data 
files. As a distributed database we have used DGraph v1.0.13 [6]. 


4.1 Benchmark Dataset and Queryset 


Linked Observation Data [15] (LOD) benchmark dataset is used 
for the experiment. LOD has near about 3000 categories. From 
that, we have used Linked Sensor Data(LSD) comprises of results 
of sensor observation results of Hurricane in the US. The dataset 
includes different observations of Wind Direction, Wind Gust, Air 
temperature, Humidity, Precipitation, etc. 

The benchmark LSD query set is used for the experiments. It 
consists of 12 queries which are classified into four types. Type 1 
and Type 2 queries are linear and star queries respectively. Type 
3 and Type 4 queries are Administrative or Range queries and 
Snowflake queries respectively.There are 3, 4,3, and 2 queries of 
Type 1 ,2, 3,and 4 respectively. Linear queries select some predicates 
from data and Star queries select specific subject/objects relevant 
to the given predicate. Administrative or Range queries are used 
to retrieve data using aggregation function or range function and 
Snowflake queries are a combination of both Type 1 and Type 2. 


4.2 Evaluation Parameters 


Performance of LBSD will be evaluated using the following quanti- 
tative and qualitative evaluation parameters: 


4.2.1 Quantitative parameters. This section discusses quantitative 
parameters that measure the performance of LBSD in terms of some 
percentage or value. 

Algorithm Execution time (AET) is the time taken by the execution 
of all three algorithms of LBSD. 

Inter-Node Communication (INC) is measured in terms of how much 
communication cost is there to answer a query using different 
nodes. 

Query Execution Time (QET) is the time taken by a query to complete 
execution. 

Query join (Q}) measures the number of join operations to execute 
a query. 


4.2.2 Qualitative parameters. This section discusses qualitative pa- 
rameters which compare the LBSD in terms of quality measures. 
Partitioning technique defines the technique used for the partition- 
ing of data. 

Distribution technique defines the technique used for the distribu- 
tion of the RDF graph. 

Workload information informs that is there any query workload 
information required for the execution. 

Replication strategy defines the technique to replicate partitioned 
data. 

Scalability defines how the system reacts when the data size in- 
creases. Storage Requirement gives an idea about amount of storage 
space used by system. 


5 RESULTS AND DISCUSSION 


LBSD is demonstrated using LSD benchmark data and query set. 
This section presents results for basic and scaled query execution 
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Figure 2: Average QET for all types 


time, Algorithm execution time. It also contains discussions about 
the choice of replication level. The results for other quantitative 
parameters like query joins and INC are also included here. 

For all the experiments of which results are presented in this 
section : 


e Results are averaged over three consecutive executions to 
reduce fluctuations for each query. 

e Data size is increased from 20K till 100K with a step size of 
20K. 


5.1 Basic Query Execution Time (QET) 


QET analysis for LBSD has been done for LSD. There are four 
types of queries and results are taken by analyzing performance for 
each of them. QETs are averaged over three consecutive executions 
to reduce fluctuations for each query. Further all the QETs are 
averaged over all the queries of that type. 

Figure 2 shows that Type 2 queries are taking less amount of time 
because it is just fetching the values whereas Type 4 queries are 
taking a larger amount of time compared to all types of queries as 
type 4 queries are snowflake queries which are complex compared 
to others. 


5.1.1. Data Scaling for QET. Data scaling experiment done for the 
size 20k to 100k(1x to 5x). Figure 3 shows that QET increases with 
increase in the datasize from 20k till 100k for all the query types. 
This increase is more pronounced for Type 2 and Type 3 queries. 


Type2 =t=Type3 =< Type 4 


40K 60K 


DATASET SIZE 


Figure 3: QET Before Partitioning 
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Type 2 queries taking a large amount of time when data size 
increases 40k to 60k as the value required to fetch is distributed 
over nodes. For all types of queries as data size increases QET is 
getting reduced after distribution as shown in Figure 4. There is on 
an average 71% of QET gain for all types of queries. 
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Figure 4: QET After Partitioning 


5.2 Algorithm Execution Time (AET) 


There are three algorithms used by the LBSD system. Algorithm 
1 and Algorithm 2 are used in first phase and second phase of 
LBSD uses Algorithm 3 and Algorithm 4. The total execution time 
taken by the system to execute all four algorithms for different data 
sizes is shown in Figure 5. We can see that as data size increases 
AET increases. There is a ramp shown in the graph when data size 
increases from 80k to 100k(4x to 5x). 
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Figure 5: Algorithm Execution Time 


5.3. Replication level 


For partial replication, to decide replication level first we have kept 
threshold at centrality 0.65. As shown in Figure 6 there is a linear 
increment in no. of triples to be replicated with increasing data size. 
On average 12% of data were replicated. When we have changed 
the threshold value to 0.51, no. of triples increased with an average 
of 14% data were replicated. But for this experiment, It has been 
found that centrality 0.65 is optimal. 
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Figure 6: Replication level for centrality 0.65 and 0.51 


5.4 Query Joins 

If we compare LBSD to DWAHP[19] or to any such relational sys- 
tem, it works better in terms of Query Joins (QJ). In the graph 
database, we can access the whole database by traversing an edge, 
which reflects the absence of QJ. This is an advantage of LBSD that 
it eliminates QJ for accelerating queries over distributed data. 


5.5 Inter Node Communication 


Inter Node Communication (INC) means the amount of communica- 
tion requires between available nodes in a distributed environment. 


Node1; Node2; Node3; 
Alpha_1 Alpha_2 Alpha_3 


Port: 8080 Port :8081 Port:8082 


| Queries! s ae 
Inter-Node Communication 


Figure 7: Inter Node Communication 


As shown in Figure 7, INC is defined by communication cost 
that query uses to answer using three available nodes, named as; 
Alpha_1, Alpha_2, Alpha_3. These three nodes are working on a 
different port of DGraph Ratel. While accelerating queries over 
distributed data, out of 12 queries on an average of 7 queries are 
answered by the local node. So approximately 58% of queries are 
answered without INC. 


5.6 Comparison of LBSD with the state of the 
art 


LBSD is compared with different techniques including APR[2] 
which is graph based state of the art work and DWAHP[19] , similar 
relational state of the art work. This section describes the detailed 
comparison of qualitative and quantitative parameters listed in 
Section 4.2. 

The Table 1 shows the qualitative comparison of LBSD with 


APR and DWAHP.LBSD is a semantic aware partitioning technique 
whereas APR uses METIS [14], a simple weighted partitioning 
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technique. Semantic aware partitioning technique keeps semantic 
relation between nodes alive even after partitioning. Due to se- 
mantic relation, similar nodes will be on the same host which will 
accelerate linear and star queries. DWAHP is a relational approach 
which uses hybrid partitioning technique using a combination of 
property and binary tables. For the distribution, LBSD uses Cen- 
trality replication threshold whereas APR uses Global Query graph 
approach to identify border nodes communication cost. It helps 
to reduce communication cost between nodes on different hosts 
through replication. 

The quantitative results for LBSD are shown in Table 2. The 
average QET gain for all types of Query reported 71%. The AET is 
averaged over all four algorithms and its total time complexity is 
O(n). The INC is 58%, ie. 58% queries are answered without INC. 
LBSD eliminated complex query join operations. 

LBSD is a static partitioning approach and APR is an adaptive 
approach for distribution and partitioning. LBSD does not require 
workload information of the implementation whereas APR and 
DWAHP both requires workload information. For adhoc queries, 
workload aware system needs to re-run the algorithm for updated 
workload information. APR is implemented in such a way that it 
also takes space in consideration while distributing data. APR works 
well with storage adaption at three levels. It also uses compressed 
replication technique in which it compresses long URI to numerical 
value. DWAHP and LBSD use Partial replication technique. APR 
and DWAHP exhibit better scalability compared to LBSD. 

QET for DWAHP and LBSD is almost the same. LBSD is reporting 
faster AET compared to DWAHP but it shows poor scalability 
beyond 4x. LBSD and APR being graph-based techniques, are able 
to eliminate Query joins. INC is approximately the same for both 
LBSD and DWAHP. APR generates small number of big clusters 
as compared to LBSD which reduces INC for APR.As a result of it, 
LBSD and DWAHP queries need to scan lesser data.APR replicates 
almost 30% of data whereas APR needs to replicate only 12% of 
data. As APR needs to rerun the algorithm for updated workload 
, the AET of LBSD reports faster than APR. There is an indexing 
overhead, as APR uses RDF-3X engine which requires indexing 
over all the three columns Subject, Object, and Predicate. 


6 CONCLUSION 


This method implemented to manage the increasing size of RDF 
data management by semantic aware partitioning and distribution 
of data using graph approach. Based on in-degree and out-degree 
of vertices LBSD partitions the data. For distribution purposes, 
we have distributed data on available three virtual nodes. LBSD 
compared in terms of two types of parameters: Qualitative and 
Quantitative. To analyze performance in terms of QET, the system 
uses 4 types of queries. It shows an average 71% gain for all types of 
queries after distribution. QET gain for type 2 queries in scalability 
experiments increases linearly with an average gain of 72% as it 
has lower INC whereas type 4 has an average gain of 55% as data 
size increases from 20k to 100k. The system also shows better 
performance in terms of inter-node communication as it answers 
58% of the query by the local node. The scalability results show 
that AET increases rapidly when data size increases from 80k to 
100k. Analysis of the rapid increment in AET is left for the future 
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Table 1: Qualitative Comparison of LBSD with state of the art 


Parameters 


LBSD 


DWAHP [19] 


APR [2] 


Partitioning technique 


Semantic Aware 


Hybrid 


Simple 


Distribution technique 


load balanced, Using repli- 
cation Threshold 


Not load balanced, Using n- 
hop reachability matrix 


load balanced, Using Global 
Query Graph 


Workload information 
Replication technique 


Not required 
Partial replication 


Required 
Partial Replication 


Required 
Compressed replication 


Scalability 


Poor beyond 4x 


Better 


Better 


Storage Requirement 


Basic data + 12% replicated 
data 


Basic Data + around 20% 
replicated data 


Basic Data +30% replicated 
data RDF 3X indexing 


Table 2: LBSD Results 


Parameters LBSD 


71% 
O(n) 


58% 
eliminated 


work. We can also make this system adaptive to deal with dynamic 
data which can also be considered as future work. 
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ABSTRACT 


In this vision paper, we introduce an idea of a framework that would 
enable us to model, represent, and manage multi-model data in a 
unified and abstract way. Its core idea exploits constructs provided 
by category theory, which is sufficiently general but still simple 
enough to cover any of the logical data models used in contem- 
porary databases. Focusing on promising features and taking into 
account mature and verified principles, we overview the key parts 
of the framework and outline open questions and research direc- 
tions that need to be further investigated. The ultimate objective is 
to pursue the idea of a self-tuning system that would permit us to 
collapse the traditionally understood conceptual and logical layers 
into just a single model allowing for unified handling of schemas, 
data instances, as well as queries. 


CCS CONCEPTS 


- Information systems — Entity relationship models; Semi- 
structured data; Data structures; Query languages. 
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1 INTRODUCTION 


The variety feature of Big Data inciting the so-called multi-model 
data has opened a challenging direction of data management. The 
(primarily) academia-driven approach, represented mainly by poly- 
stores [9], is based on the idea of polyglot persistence, i.e., the usage of 
a mediator managing a set of underlying database management sys- 
tems (DBMSs), each being the best suitable candidate for a particular 
data model. On the other hand, there are (industry-driven) multi- 
model DBMSs [14] that offer the support of multiple models under 
the hood of a single system, treating all the data models as first-class 
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citizens [15]. Both the approaches have to face the same challenges 
brought by contradictory features of different but interlinked mod- 
els and their data management specifics. For example, there are 
structured/semi-structured/unstructured formats, systems based 
on strong/eventual consistency, schema-full/schema-less/schema- 
mixed systems, declarative/functional query languages, etc. 

In this vision paper, we describe the necessary steps that need 
to be carried out in order to provide a full-fledged and universally 
applicable solution to the problem of multi-model data manage- 
ment. We argue that a highly promising approach could be based 
on category theory [3], a theory general enough to cover all the 
currently popular data models (and probably even more) and hav- 
ing a sound mathematical background important for the efficient 
and correct data processing. Exploiting the constructs it offers, we 
provide a vision of a complex framework that would allow us to 
merge the conceptual and logical layers into just a single model 
through which schema descriptions, data instances, as well as query 
expressions could be handled in an entirely uniform and abstract 
way. 

Based on the inspirational related work and with the help of a 
set of illustrating examples, the main contributions of the paper 
are: (1) discussion of ways how category theory can bring answers 
and solutions to the aspects of schema modeling, data representa- 
tion, querying as well as evolution, transformations or migration, 
(2) identification of the related open questions and research direc- 
tions that need to be further investigated so that the framework 
can be applied in polystore and multi-model scenarios, and (3) the 
actual description of the envisioned category-based unified au- 
tonomous DBMS, including its main advantages with respect to the 
contemporary systems. 

Section 2 provides a brief introduction to category theory, Sec- 
tion 3 then thoroughly introduces the key parts of the framework, so 
that in Section 4, we summarize advantages of the category-based 
unified DBMS and conclude. 


2 CATEGORIES 


Formally, a category C = (O, M, 0) consists of a set of objects O 
serving as graph vertices, a set of morphisms M acting as directed 
edges, and a composition operation o for the morphisms. 

Each morphism is modeled and depicted as an arrow f : A > B, 
where A,B € O, A being referenced to as a domain and B as a 
codomain, respectively. Whenever f,g € M are two morphisms f : 
A— Bandg:B-C, it must hold that go f ¢ M,ie., morphisms 
can be composed using the o operation and the composite go f must 
also be a morphism of the category (i.e., transitivity is required). 
Moreover, o must be associative, i.e. ho (go f) = (hog) of for 
any morphisms f,g,h e M, f: A> B,g:B—->C,andh:C—D. 
Finally, for every object A, there must exist an identity morphism 
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graph Friends 


Mary 


P41 wany [smi _| : 


Friend Friend 


Friend 
Anne John 


Customerld 


table Customer table Credit column family Orders 


[220, 230, ...] 


Customer 


[10, 217, ...] 


Martin Svoboda, Pavel Conto8, and Irena Holubova 


collection Order | 


{ Orderld : 220, Items : [{ Productld : B1, Name : Fairy Tales, Quantity : 1 }] } 
{ Orderld : 217, Items : [{ Productld : T1, Name : Toy Car, Quantity : 2 }]} 


collection Product | 


{ Productld : B1, Kind : Book, Name : Fairy Tales, Price : 20 } 
{ Productld : T1, Kind : Toy, Name : Toy Car, Price : 35 } 


Orders 


[94, 137, 214] 


Figure 1: Sample multi-model scenario 


14 such that fol, = f = 1p0°f for any f : A — B (obviously 
serving as a unit with respect to the composition). 

As an example, let us mention at least the Set category, where 
objects are arbitrary sets and morphisms functions between them. 
Note, however, that both objects and morphisms can, in general, 
represent abstract entities of any kind. 

Considering the related work, category theory is a promising so- 
lution for the indicated issues. Existing bottom-up approaches start 
froma single logical model (relational [20, 21], object-relational [24], 
or CSV/document/RDF [23]) and define a respective schema cate- 
gory and operations using standard categorical approaches (such 
as functors). A top-down approach from [12] defines a schema cate- 
gory covering various conceptual modeling approaches, but only 
with respect to the most common model of that time — the rela- 
tional model. In this paper, we also follow this direction, however, 
with respect to multiple interlinked data models at a time. In other 
words, we will show that the categorical approach can also be used 
for the multi-model world. 


3 FRAMEWORK 


The objective of the following text is to describe features, require- 
ments, and principles of the core components of the envisioned 
database framework that would allow us to formally grasp the ex- 
isting scenarios of polystores and multi-model databases, as well as 
could potentially contribute to the proposal of a truly unified and 
conceptual database system. 


3.1 Multi-Model Scenario 


In order to demonstrate the intended functionality, let us assume a 
sample multi-model scenario we will use throughout the individual 
illustrative examples. 


EXAMPLE 1. Figure 1 provides an example of a multi-model scenario. 
The relational model (violet) contains general information about customers, 
whereas the graph model (blue) captures their mutual friendship. The document 
model (green) maintains orders which are bound with particular customers 
using the wide-column model (red). Oo 


The range of the available and widely used logical models is, of 
course, wider than the particular models we incorporated into our 
sample scenario. While the traditional relational model was and 
still is used as the primary option, a recently emerged family of 
NoSQL systems enabled wider usage of key/value, wide column, 
document, and graph models, too. As we have mentioned, their 
contradictory features, unfortunately, increase the complexity of 
handling the multi-model data in a truly universal way. 
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For the purpose of describing schemas of data at the conceptual 
layer (which intentionally abstracts and conceals specific details of 
the underlying logical models), existing mature and frequently used 
modeling languages such as ER [5, 17] and UML [19] (class diagrams 
in particular) can be exploited (see Figure 2 for the sample scenario). 
Unfortunately, while ER is more expressive and suitable for describ- 
ing complex real-world relationships, it is not well-formalized and 
exists in various notations differing not just visually. On the other 
hand, UML is standardized, but only too data-oriented (lacking, e.g., 
weak entity types or other constructs). 


Lastname Kind 
Firstname? Customerld Productld / 
O 


* * : Price 
Customer 


(0,*) 


Credit 
O 


O 
Orderld Quantity 


Figure 2: ER schema diagram for the sample data 


Although this probably should not be the case, the main disad- 
vantage of both the existing conceptual approaches is that they 
are actually not entirely suitable for grasping distinct data struc- 
tures assumed by different logical models, simply because they 
are actually very closely related and inspired by the traditional 
relational model. This means they are built on top of the notion of 
ordinary sets of tuples in the first normal form, and so clearly not 
fully conforming to the principles and nature of the other models 
that permit, e.g., repeated values, union types, or hierarchically 
constructed properties. 

There actually already exist papers dealing with ER modeling 
of multi-model data (e.g. [4, 10]), primarily for polystores. Unfor- 
tunately, they lack crucial details of mapping from the conceptual 
layer to the particular DBMSs, which strongly influences inter- 
model references, model overlapping, query optimization, evolution 
management, etc. 


3.2 Schema Modeling 


As a consequence of the previous observations, a new strategy 
capable of unified multi-model schema description and data repre- 
sentation needs to be found. We believe that an approach based on 
category theory could be promising enough, not just because of its 
universality and formal theoretical background, but also because of 
the already outlined existing approaches, though they are suffering 
from various drawbacks and are not yet ready for the multi-model 
scenarios. 
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The following idea of a schema category! can form the basis 
for the categorical conceptual modeling via which we will be able 
to describe the intended structure of the data together with basic 
integrity constraints. Borrowing the terminology from ER, objects 
of this category will represent individual entity types, attributes, 
as well as relationship types (without necessarily needing to distin- 
guish between them later on, in the optimal case). In order not to 
disallow entity types with several or composed identifiers, super- 
identifiers can be exploited to encompass all of them, if needed. 
Morphisms will then interconnect the corresponding objects, allow- 
ing us to model the traditional concept of relationship cardinalities 
and attribute multiplicities through (min, max) constraints, where 
min € {0,1} and max € {1,*}. Obviously, min € {0,1} would 
restrain the lower bound of a number of occurrences (optional vs. 
compulsory) and max € {1, *} the upper bound (at most one vs. at 
least one). 

In order to make our category visualizations easier, we follow 
the convention that identity and non-core morphisms belonging to 
the transitive closure over the composition operation are entirely 
omitted, and that morphisms are labeled only with cardinalities 
different from (1, 1), being treated as the default. 


Firstname C) © Lastname Kind C) © Name 
Customerld CE O Credit Onder Productld ©) O Price 
n\ hr {{Orderld}} Items ~\ I 
es O=—=0 <— O00 
: Peon Orders (0,*) 1 (0,+) 4 (0.*) Product 
om aa O O {{Productld}} 
Friend Orderld Quantity 


Figure 3: Schema category for the sample data 


EXAMPLE 2. In Figure 3, we can see the schema category corresponding to 
the ER model from Figure 2. For instance, the object (node) Customer represents 
the entity type Customer. Neighboring objects connected via morphism (edges) 
correspond to its attributes (Firstname, Lastname, etc.) as well as relationship 
types (Orders and Friend) with respective cardinalities. Oo 


Even though schema categories could also be designed directly 
by database users, this strategy might not be considered conve- 
nient enough, not just because of technical aspects that need to be 
carefully followed and ensured. Thus the outlined concept should 
primarily be understood as a means for internal database schema 
representation rather than a modeling language as such. This, in 
other words, means that a corresponding user-friendly modeling 
language would need to be proposed, too. 

Moreover, in order to enable the option of deploying the envi- 
sioned framework in the context of the existing systems, transfor- 
mations of ER and UML schemas, as well as at least semi-automatic 
inference methods capable of deriving schema categories from sam- 
ple input data, would need to be proposed, too. More broadly, prin- 
ciples of generic schema management should be followed, similarly 
as in [1]. 


‘For the sake of simplicity, we omit technical details and more complex constructs. 
A separate paper [22] is devoted to the proposal of the schema category and the 
transformation process. In addition, the category itself can be designed variously, 
which is one of the open problems of the idea. 
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3.3 Database Decomposition 


The very idea behind polystore and multi-model database scenarios 
is that different parts of the data are logically modeled, physically 
stored, and further processed differently by means of the corre- 
sponding logical models and query languages they are accompanied 
with. Having the schema category, the next step is to decompose it 
into such parts, let us call them database components. In practical 
terms, each of these components is expected to correspond to one 
of the underlying systems involved in a polystore (each one of them 
is accessible and queryable, yet specifically), and logical models 
involved in a multi-model database (with data directly and natively 
retrievable), respectively. 

Each of these components would consist of a set of particular 
selected objects and morphisms, as if forming kind of a subgraph of 
the entire category that permits to incorporate the individual mor- 
phisms even without their ending objects (domain and codomain). 
Moreover, these components might be and most likely will often 
tend to intentionally be disconnected and/or more-or-less over- 
lapping in real-world use cases. While the former aspect would 
require to take into account also the derived morphisms during 
the decomposition process, i.e., non-core morphisms derived using 
the category composition operation, the latter one finds its applica- 
bility in deliberate data redundancy across logical models, and so 
potentially improving query evaluation efficiency for data that is 
expected to be accessed together. 


EXAMPLE 3. Sample schema category decomposition into particular data 
models (depicted using the colors from Figure 1) is visualized in Figure 4. For 
instance, the document (green) component covers orders. The graph (blue) 
and relational (violet) components partially overlap, since we maintain the 
respective attributes (Firstname and CustomerId) in both the models. Oo 


Firstname C) © Lastname Kind) © Name 


SCEOMA Oecd Drder antes So icone 


{{Orderld}} Items 
oe — ee —e — 
(0+) Product 


if isomer Orders (O*) ) (0+) 
OC ) KOR {{Productld}} 


Friend Orderld Quantity 


(0,) 


Figure 4: Decomposition into database components 


Focusing on a particular component, its objects and morphisms 
will then need to be internally mapped to particular logical con- 
structs and structures provided by a given model, e.g., tables (to- 
gether with their columns, primary keys, foreign keys, or other 
constraints) in case of the relational model, or collections of JSON 
documents (and their objects, arrays, and values) in case of the 
document model assumed by MongoDB. 


EXAMPLE 4. Let us consider the document (green) component in Figure 4. 
There exist several ways how to map this part of the category to one or more 
documents. Using a (semi-)automatic approach, the developer may choose, ¢.g., 
the following JSON schema (expressed symbolically): 

{ OrderId, Items: [ { ProductId, Name, Quantity }* ] } 

{ ProductId, Kind, Name, Price } Oo 


3.4 Data Representation 


Data assigned to each of the outlined database components is stored 
and logically represented differently, thus we have to find a way 
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how such diversity could be encompassed within just one kind of a 
unifying data structure that would be capable of handling all the 
specifics of the individual models, but would still be simple enough. 
It is also worth realizing that this structure should serve not just 
as a means of modeling the data actually present in the database 
(and so conforming to its conceptual schema), but also a means to 
represent results of the evaluated query expressions, including the 
intermediate ones. 

While the existing multi-model DBMSs [14] more or less painfully 
create an extension of the data structures used for the original sin- 
gle model, there exist proposals of more general approaches, too. 
E.g., the NoSQL Abstract Model (NoAM) [2] represents the data as 
named collections, each containing a set of blocks consisting of a 
non-empty set of entries. Associative arrays [8] are then defined as 
mappings from pairs of unique (column and row) keys to values. 
Or the Tensor Data Model (TDM) [11], which introduces the idea of 
generalized matrices. We believe, however, we can go significantly 
further. 

It would only be beneficial if the desired data structure could be 
derived from the outlined schema category. Such an objective could 
be achievable through an instance category. The idea is that this 
category would have exactly the same objects and morphisms when 
compared to the corresponding schema category, they would only 
differ in what the objects and morphisms mean and contain, i.e., 
what they are supposed to internally model. Perhaps tables, ie., sets 
of tuples, could be utilized as a basis, at least for now. Hence, the 
idea is that each entity/relationship/attribute object could contain 
a set of values belonging to the active domain, and morphisms 
mappings of pairs of such values. 


EXAMPLE 5. Figure 5 provides a part of the instance category bearing the 
particular information about each product, i.e. ProductId and Name. Oo 


Product j a 
{{Productld}} > ep Name 


Figure 5: Part of an instance category 


Obviously, it may not have been a good direction to stick with 
the principles of the relational model since complications at least 
with the ordering of values, duplicates, or non-trivial cardinalities 
of sub-attributes will unavoidably appear. Nonetheless, having the 
instance category proposed, a particular data format and techniques 
for data transformations will then also be needed (analogously 
to [13, 20]) so that a unified instance category can be obtained for 
input data in a particular format/model as well as such a category 
serialized in the opposite direction. Such low-level transformations 
will then find their essential position in data migration, evolution, 
or database self-tuning processes. 


3.5 Query Language 


The core functionality of each database system lies in its capability 
to query the data stored within it [26]. For obvious reasons, the 
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expressive power of the provided query language must be suffi- 
cient with respect to the intended purpose and usage [6], though 
it may significantly differ across the models, systems, and partic- 
ular languages, too. Similarly, evaluation of query expressions as 
such must be efficient enough. The major limiting disadvantage 
of the existing settings is that users must be thoroughly aware of 
the logical schema of the data. It means that the languages and so 
the query expressions are, not surprisingly, tightly bound with the 
structure of the data to be queried. While this can be acceptable in 
single-model systems, it no longer is in truly multi-model ones. 

Therefore we advocate for finding ways of querying that would 
genuinely be conceptual, though expressive enough from the prac- 
tical point of view. In other words, the goal is to find a unified 
conceptual query language that would permit to query the data 
without any further knowledge of the database decomposition (pos- 
sibly even changing through time) and regardless of the specifics of 
the individual involved logical models. This means one query lan- 
guage, one query constructs, one syntax. Moreover, it should also 
be closed with respect to the data model, i.e., that both extensional 
data as well as intermediate and final query results are modeled 
uniformly, via instance categories. 

The envisioned querying could be based on sub-graph pattern 
matching widely exploited in graph databases, simply because 
graph models are broadly understood as the most complicated ones, 
as well as because categories in general are tightly related to graphs. 
The idea is to describe one graph pattern we are looking for in terms 
of a pattern category, consisting of only the selected objects and 
morphisms available in the corresponding schema category. The in- 
dividual objects could be associated with selection conditions acting 
as domain filters. These conditions may be however complicated 
(with ranges, enumerations, conjunctions, disjunctions, ...), but 
may only involve a given object. Similarly, even morphisms could 
be equipped with conditions (comparisons, ...), either filtering or 
joining, always referencing only the pair of involved objects. 


EXAMPLE 6. Figure 6 introduces a pattern category for a query that should 
return names, kinds, and prices of all books or toys which can be bought by 
a customer with name Mary. As we can see, the black parts correspond to 
the schema category objects and morphisms, and so the database contents, 
whereas the red part is a newly introduced morphism enabling to join the two 
fragments using a joining condition. Grey-filled circles represent projection 
(objects that should appear in the final result). Oo 


Firstname Customer Credit Price Product Kind 
O<— 0—~>0 —____— 0<«,0— 0 
: a » credit > price x 
Firstname = “‘Mary kind = “Book” 


Name V kind = “Toy” 


Figure 6: Pattern category for a sample query 


Morphisms that were taken over from the schema category ba- 
sically act as the corresponding inner joins, extensionally defined 
directly by the contents of the database. If it happens there are 
two or more disconnected parts (only assuming the adopted mor- 
phisms), they are joined using the Cartesian product (cross join), or 
a theta-join in case there are newly added morphisms with further 
joining conditions. It is also important to realize that particular 
schema objects and morphisms can be reused repeatedly in a query 
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pattern. All in all, everything described inside one pattern category 
is expected to be satisfied as if in conjunction, i.e., based on the 
all-or-nothing principle. 

The notion of a pattern category obviously cannot be used in 
order to describe more complicated queries. Therefore the next 
natural step is to allow ourselves to construct a query category 
through which we would be able to combine several simple pattern 
categories by advanced query constructs and operations modeled as 
special query objects. That would hopefully permit to incorporate 
not just the following widely considered constructs: set operations 
over the patterns (union, intersection, difference), existential and 
general quantifiers including their negations, or optional patterns. 
Though there will undoubtedly be a variety of not just technical 
obstacles, we also need to be able to cover disjunctions of patterns 
through which complicated disjunctive conditions involving several 
objects could be expressed. Similarly, non-binary filtering or joining 
conditions, grouping and aggregation, or derivation of new values 
not present in the database via arithmetic expressions, function 
calls, etc. 

Syntax of the actual query language as such will also need to be 
carefully designed so that it permits to express all the ideas but still 
remains user-friendly enough. While for the purpose of describing 
the individual pattern categories the idea of ASCII-art-inspired 
syntax assumed by Cypher query language in Neo4j could most 
likely be exploited, principles of the composition of more complex 
queries is a more significant challenge, possibly benefiting from the 
ideas of sub-queries or chaining of clauses, or simply the existing 
proposals such as SQL++ [16]. 


3.6 Query Evaluation 


Based on the knowledge of the schema category, database decom- 
position, and internal schema mappings, the query category now 
needs to be decomposed into individual query parts, each to be eval- 
uated separately so that it produces the corresponding intermediate 
result modeled in terms of instance categories with their schemas. 
Each query part consists of the selected objects and morphisms 
from the query category, yet not all of them do necessarily need to 
be involved. In fact, in the case of cross-model queries, evaluation 
of certain morphisms and/or query objects will intentionally need 
to be postponed, simply because no database component is capable 
of their evaluation on its own. 

While in a polystore scenario each query part needs to be trans- 
lated into the corresponding query expression within the specific 
query language a given system supports, and this expression inter- 
nally evaluated so that the yielded result can be transformed into 
our unified logical representation, the corresponding intermediate 
result can directly be obtained from the database data files (at the 
physical layer) in case of a multi-model scenario. 


Firstname Customer Credit Price Product Kind 
© —_6e¢ Ofna. © — 
: - . credit > price je 
Firstname = “Mary O kind — “Book” 
Name V kind = “Toy” 


Figure 7: Query category decomposition 
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EXAMPLE 7. Our decomposed inter-model query is depicted in Figure 7. 
The relational part (violet) extracts credits of customers named Mary: 
SELECT T2.Credit 
FROM Customer AS T1 NATURAL JOIN Credit AS T2 
WHERE 11.Firstname = "Mary"; 
Analogously, using the MongoDB query language, the document part (green) 
extracts all properties but Productld of books or toys: 
db.Product.find( 
{ Kind: { $in: [ "Book", "Toy" ] } }, { ProductId: @ } 
); 


O 


Note that multiple query parts belonging to the same database 
component may be generated, because the expressive power of a 
given query language may not be sufficiently high. For example, 
we can expect separate query parts and so multiple find queries for 
each of the involved collections in the case of MongoDB. Similar 
issues can appear in key/value or wide column databases, too. 

Last but not least, while the query translation process takes a 
query category part and rewrites it to a specific query expression, 
we also need to deal with the opposite direction so that the existing 
query expressions in various at least widely used languages can be 
taken and migrated to our categorical representation. 

Having a given query part evaluated, the obtained intermediate 
result needs to be represented as an instance category with a struc- 
ture conforming to a newly defined schema category. For example, 
having an intermediate result in a form of a table retrieved from a 
relational system or a collection of possibly projected documents 
from MongoDB, its corresponding schema category could be con- 
structed by contracting the given query category part (its objects 
and morphisms) into a star with all the involved attribute objects 
(simple or with sub-attributes) collected around the central object 
representing the super-identifier of the entire obtained result. The 
central object is derived from the backbone objects and morphisms 
of the query part, i.e., as a result of more-or-less complicated joining 
process. 

The schema contraction step is essential in order to support 
advanced query constructs such as groupings, aggregations, or 
derivation of new values in general. As a consequence, however, 
we may lose the interpretability of certain objects or morphisms 
with respect to the original schema. 

Once all the recognized query parts are evaluated and so the 
intermediate results obtained, the evaluation of all the postponed 
and not yet considered morphisms and/or query objects can be 
completed at the unified layer of the framework. This means that 
we need to evaluate these morphisms/objects one by one, partially 
contracting the corresponding parts of the current query category, 
step by step, so that we eventually retrieve the result of the entire 
query in the very end. 


EXAMPLE 8. As depicted in Figure 8, the intermediate results for the 
relational (violet) and document (green) parts of the query were transformed 
into instance categories with the respective contracted schemas. By evaluating 
the postponed theta join, we retrieved the overall query result. Oo 


Evaluation and optimization of queries are no doubt absolutely 
crucial aspects defining the actual practical usability of database 
systems. Thus multiple heuristically designed query evaluation 
plans need to be considered so that the one with the lowest cost 
can be selected and actually executed as the only one. Apparently, 
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R1 Ro -» © Kind Ry O Kind 
O O-—-> O@\Name => O— O Name 
Credit C) —— = > @ Price > O Price 
credit > price 
Credit {i Kind: Book, Kind Name_ |Price 
Name : Fairy Tales, 
30 Price : 20 }, Book | Fairy tales} 20 
{ Kind : Toy, 
Name : Toy Car, 
Price : 35 }} 


Figure 8: Intermediate and final query results 


the efficient evaluation also needs to be supported by indices, in 
this case model-agnostics [15]. 


EXAMPLE 9. An alternative (though probably less efficient) query plan 
depicted in Figure 9 involves, in addition, also the graph model (blue), where 
names of customers can also be extracted. Oo 


Firstname Customer Credit Price Product Kind 
©. &¢ © ene, © — 
: ” - credit > price ia 
Firstname = ‘‘Mary Peay = araly 2 


Name V kind = “Toy” 


Figure 9: Alternative query evaluation plan 


3.7 Evolution Management 


Evolution management is a process of preserving the integrity of 
the whole system when user requirements change. A change in the 
structure of the data may require changes in related data instances, 
queries, integrity constraints, storage strategies, etc. Thanks to the 
general representation of all the data models using the schema 
category, we can define a unified cross-model set of schema modifi- 
cation operators (SMOs). The abstract categorical layer also enables 
backward compatibility of the queries even when the storage strate- 
gies and related mappings may change. 


EXAMPLE 10. An example of evolution might involve merging of attributes 
Firstname and Lastname into one attribute Fullname. In addition, the data 
about customers can be migrated from the graph model to the relational model 
only. In Figure 10, we can see the old and new versions of the affected parts 
of the schema category. While the query accessing the modified part must be 
adapted accordingly in the first case, the schema category and so the categorical 
queries remain the same in the second case. Oo 


The existing multi-model evolution management approaches 
provide an interesting inspiration, but they also have various sig- 
nificant limitations, such as, they only focus on aggregate-oriented 
NoSQL systems [7] or they omit critical inter-model links [18]. 
A sufficiently general set of SMOs can most likely be borrowed 
from [25]. 


Firstname C) ©) Lastname O Fullname 
Customerld C) O Credit Customerld C) O Credit 
\4 nla 
(0,*) ¥ O Customer (0,«) 54 O Customer 
ee {{Customerld}} => GE {{Customerld}} 
O- (0+) O- (0+) 
Friend Friend 


Figure 10: Old and new versions of schema category 
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4 SUMMARY AND CONCLUSION 


The outlined framework has many features that seem to be promis- 
ing and are undoubtedly worth further exploration. Either way, it 
can be deployed in both the polystore and multi-model scenarios, 
i.e., it is applicable to state-of-the-art systems. While we believe 
it can become a basis of newly designed database management 
systems, it still relies on, integrates, or extends various existing 
verified approaches, as well as it further elaborates concepts already 
advocating such objectives [15]. 

In other words, we propose to go beyond the ideas and bound- 
aries of the contemporary systems toward entirely unified and 
conceptual databases that would allow us to fully abandon the idea 
of various logical models together with their distinct data struc- 
tures, specific features, and often proprietary query languages, as it 
is demonstrated in Figure 11. Though such systems could be based 
on category theory, as we have outlined, other well-established 
formalisms could possibly serve as well. 

It is apparent that there are only too many existing systems, 
models, formats, and languages. If it is difficult for the users to get 
sufficiently acquainted with such approaches and make decisions 
whether and when to use them in single-model situations, the more 
challenging the task it becomes when polyglot persistence is fol- 
lowed, i.e, when multiple models and systems should be considered, 
understood, deployed, and maintained at the same time. Obviously, 
at least from the long-term perspective, it is highly unlikely that 
such a variety together with currently ongoing trends could remain 
sufficiently sustainable. 


Single-model Polystore Multi-model Unified 
scenario scenario scenario scenario 


Conceptual 
layer 
Logical 
layer 
Physical 
layer 


Figure 11: Layered architecture of database systems 


Besides the already mentioned key aspects of the envisioned 
self-tunable database management system, we think the emphasis 
should be put on the formal background, unification, conceptual- 
ity, and user-friendliness. In other words, it is necessary to find a 
reasonable balance between the utilization of high-level formal the- 
ories such as the one assumed, and the practical applicability of the 
whole approach on the other hand. The core differences, features, 
advantages, and contributions represented by the framework we 
envisioned are summarized as follows. 


e Intended schema of the data is described only once and at 
the conceptual platform-independent layer, only targeting 
real-world entities and relationships they can enter into. 

e Schema categories have higher expressive power than the 
traditional ER or UML languages, seamlessly allowing to 
work with structured attributes or other constructs. 

e Attributes, entity types, and relationship types no longer 
need to be mutually distinguished and handled differently, 
and so the entire modeling process can be simplified. 
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e Instance categories are general enough to serve as a univer- 
sal data structure for all the currently widely used models, 
permitting the integration of future suitable models, too. 

e The involved logical models play mutually equal roles, none 
of them is given priority, as is often the case in the existing 
multi-model systems. 

e The traditional conceptual and logical layers collapsed into 
just a single unified representation, so they no longer need 
to be considered separately. 

e Handling of the data variety aspect is pushed from the for- 
mer logical layer to the physical one, where various file 
organizations and indices can appropriately be utilized. 

e Decomposition into different models became internal with 
no impact on the users, and so they do not need to be aware 
of it, nor are they forced to adjust the style of thinking with 
respect to the particular models involved. 

e Individual components can intentionally overlap each other, 
as well as do not necessarily need to contain all the data so 
that long-running migrations can be supported. 

e Though the users can supervise the decomposition process, 
it can be fully autonomous and capable of reorganizing the 
data based on the changing workload or other aspects. 

e Query language and its constructs are conceptual, and so 
independent on the internal representation of the data, not 
requiring the users to have its deeper knowledge. 

e Querying itself is based on sub-graph pattern matching, a 
mature enough principle from graph databases allowing for 
the evaluation of complex queries. 

e The query model is closed with respect to the input data 
and intermediate/overall results, all consistently modeled in 
terms of instance categories. 

e Schemas, data, as well as queries are all modeled via homoge- 
neously designed categories, all following the same structure 
and principles. 

e Inter-model references or links of other kinds and cross- 
model queries are supported natively, which is not the case 
of many an existing system. 

e Category theory provides a strong formal background, which 
brings the potential for advanced query optimization tech- 
niques that would otherwise be difficult to achieve. 


Although we covered the key components concerning schema 
modeling, data representation, and querying, there are also other im- 
portant aspects of multi-model data management we did not cover. 
We also did not want to provide a complex solution but to outline 
promising research directions and encourage both researchers and 
practitioners to pursue the envisioned ideas. 
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ABSTRACT 


The present work describes the analysis conducted on the 
diagnoses made during the general physical examinations in 
the decade 2010-2020, starting from the DB of the EMR 
previously implemented in the University Veterinary Teaching 
Hospital at Federico II University of Naples. A decision tree 
algorithm was implemented to work out a predictive model 
for an effective recognition of neoplastic diseases and 
zoonoses for cats and dogs from Campania Region. The results 
achievable by data mining techniques for what concerns 
computer aided disease diagnosis and exploration of risk 
factors and their relations to diseases, show the increasing 
importance of Veterinary Informatics within the wider field of 
Biomedical and Health Informatics, and in particular its 
capacity to point out the existing connections between 
humans, animals, and surrounding environment, according to 
the One (Digital) Health perspective specifics. 
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1 Introduction 


Animals, be they categorized as pets, livestock, or wildlife, 
stand as essential element in the evolution of human race for 
countless reasons. In particular, animal healthcare-related 
aspects play a prominent role because of their strict 
connections with human health. The monitoring of both 
wildlife and syntropic species’ health state can provide in fact 
valuable information about (i) the quality of the environment 
they live in, and that they share with humans, in terms of 
pollution level, as well as food safety and traceability 
management; (ii) the occurring of zoonotic phenomena (for 
instance, leptospirosis and the recent COVID-19 pandemic). 
Furthermore, many non-infectious diseases (e.g. diabetes, 
cancer, and renal failure) are similar in both animals and 
humans [1]. Consequently, the need for an effective tracking 
of veterinary information to facilitate integration of animal 
medical data to support Public Health, has become essential. 
As a matter of fact, under the epidemiological perspective the 
advantages of using animals as sentinels or comparative 
models of human diseases are well known, as animals - or 
better, animal sentinels - may be sensitive indicators of 
environmental hazards and provide an early warning system 
for public health interventions [2]. With specific reference to 
Campania Region, this kind of studies are of particular 
concern due to the widely known so-called “Terra dei 
Fuochi/Land of Fires” phenomenon (see e.g. [3,4]). An as 
important aspect relates then to the control of the zoonoses, 
i.e. those diseases that can be transmitted from the animals to 
the human beings via faeces, urine, saliva, or blood. It is the 
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case of e.g. intestinal parasites and ticks (that use the animal 
as a vector), or rabies (transmitted via the saliva). Such risks 
have to be carefully taken into account when it comes to the 
cohabitation between humans and (conventional as well as 
non-conventional) pets [5]. The implementation of integrated 
veterinary information management systems (VIMS) for the 
capture, storage, analysis and retrieval of data, provides the 
opportunity for the cumulative gathering of the knowledge, 
and the capability for its competent interpretation [6]. To this 
end, it becomes useful to resort to data mining computational 
methods for extracting knowledge also in the case of animal 
large databases. Among the most diffused data mining 
algorithms [7,8], decision tree provides a_ tree-based 
classification for developing a predictive model according to 
independent variables [9]. 

In this paper the main results will be shown from the analysis 
of the data extracted from PONGO software ©, i.e. the first 
EMR solution implemented in the University Veterinary 
Teaching Hospital (it: OVUD, acronym for Ospedale 
Veterinario Universitario Didattico) of the “Federico II” 
University of Naples, Italy. The main goal was to establish, by 
means of decision tree algorithm, a predictive model for an 
effective recognition of neoplastic diseases and zoonoses 
using clinical data, according to clinical, para-clinical, and 
demographic attributes. The investigation on the quality of 
clinical data of OVUD’s patients is intended for helping, at least 
on a region-wide scenario, to find out the presence of specific 
connections between people's health, animal health, and their 
surrounding environment, thus conveying the specific Public 
Health dimension into the greater One (Digital) Health 
scenario [10]. 


2 Materials and Methods 


2.1 Subjects 


The data extracted from PONGO sw in form of MS Access DB 
relate to the general physical examination (GPE), that is the 
first visit performed from the veterinarian when the animal 
arrives to the hospital. The database contains about 10360 
rows (one row per animal access) which span over a period 
going from 2010 to mid-2020. The visits were mainly 
performed on pets, i.e. dogs (n = 8925; 86%) and cats (n = 
1181; 11%). Horses occurred to be treated in the hospital as 
well (n = 160; 2%). Only for a small part (n = 92; 1%) the 
animals examined belonged to other species (ducks, donkeys, 
bovines, buffaloes, goats, lagomorphs, rodents, tortoises, and 
birds). Besides animal species and date of the visit, the main 
fields of the DB also related to age and sex of the animal, main 
health issue (HI) acknowledged during the GPE, type of 
feeding (e.g. commercial vs. homemade), and vaccination 
status information. Also considered in the study were the kind 
of environments the animal used to live in (eg. in an 
apartment, or outdoors), and the Italian province it came 
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from. As for the latter point, the research was limited to the 
provinces of Campania Region, due to the marginal number of 
rows related to patients coming from other Italian regions. 
Table 1 reports the accesses to OVUD, based on the 
geographic provenance, for dogs, cats, and horses. Several 
OLAP operations [11-14] were performed on the mentioned 
dataset, in order to investigate the quality of clinical data of 
OVUD’s patients for the considered time period. 

Given the situation, it was decided to focus the investigation 
only on dogs and cats. 


Table 1: Distribution of dogs, cats, and horses that 
accessed the OVUD, according to the Italian provinces 


Species Province # % 
Avellino 145 2% 
Benevento 71 1% 
Caserta 753 8% 
HOE Napoli 6967 78% 
Salerno 444 5% 
Other Italian provinces 545 6% 
Avellino ZZ 2% 
Benevento 9 1% 
ea Caserta 68 6% 
Napoli 975 83% 
Salerno 41 3% 
Other Italian provinces 66 6% 
Avellino 3 2% 
Benevento fs 4% 
Horse Caserta 14 9% 
Napoli 64 40% 
Salerno 49 31% 


2.2 Accesses per animal sex 


Four types of sex specifications have to be considered for 
animals: male (M), castrated male (MC), female (F), spayed 
female (FS). Figure 1 reports the accesses to the OVUD of dogs 
and cats, respectively, for the time period considered. The 
number of rows/visits for which it was not possible to 
retrieve the sex of the animal, were also reported. Only in one 
case, the animal (dog) was reported as not visited after the 
access in the hospital. It has to be pointed out that the lower 
number of accesses registered in 2016 in both cases, was due 
to a partial stop of the OVUD activities, as a structural collapse 
interested at the end of 2015 part of the University 
Department that hosts the hospital itself. The number of male 
dogs’ accesses is about twice as much the female accesses in 
almost all the years considered, with quite lower numbers for 
the neutered dogs. A different situation concerns cats, where 
the differences M/MC and F/FS tend to be proportionally 
shorter, sometimes in favour of the neutered exemplars. 
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2.3 Health Issues per year 


It was possible to identify about 140 different diagnoses from 
the GPE for the period considered. For mere space reasons, it 
was decided for both dogs and cats to investigate, for each 
year, only the three most relevant health issues (HIs), as 
reported in Tables 2 and 3. In case of HIs featuring the same 
occurrences, they were all considered. The only exception is 
for cats’ HIs in 2012, where the occurrences for HI #3 were 
equal to 1 for a very large set of issues, so it was decided not 
to report it in the table. 
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20% 
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50% 
40% 
one 
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Figure 1: Accesses to OVUD of dogs (up) and cats (down) 


Table 2: The three most diagnosed health issues for dogs 
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Year HI #1 HI #2 HI #3 
Skin lesion (on bas 
2012 Alopecia exam.); o oe 
ae inspection of vomit 
Limping 
pci : On exam. - 
2013 Limping Alopecia inspection of vain 
oT Neoplastic ; 
2014 Limping Saas Alopecia 
es Neoplastic ; 
2015 Limping Sieeaee Alopecia 
Injury of 5 
2016 Limping abdomen; Te 
; (on exam.) 
Alopecia 
Firm lymph 
i, node (on eee 
2017 Limping cn): Neoplastic disease 
Alopecia 
oe Injury of a 
2018 Limping eae Neoplastic disease 
Injury of Cough: 
2019 eran NeepleHe Limping 
disease 
2020 Injury of Miopeeta . pao a 
abdomen inspection of vomit 


Table 3: The three most diagnosed health issues for cats 


Year HI #1 HI #2 HI #3 
ee Injury of Firm lymph node 
ee nEapne abdomen (on exam.) 
Paini 
On exam. - o a ve 
ee ; Skin lesion (on 
2011 Limping inspection of 
vomit exam); 
Cough 
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Year HI #1 HI #2 HI #3 
On exam. - 
inspection of Injury of Urinary tract 
2010 ; 
vomit; abdomen pain 
Pain in eye 
akon Urinary tract 
2011 inspection of Alopecia 4 
' pain 
vomit 
On exam. - . 
ee er Firm lymph node 
2012 eee (on exam); 
vomit 
ie Alopecia 
Pain in eye 
On exam. - Pain in eye; 
2013 inspection of Injury of Alopecia 
vomit abdomen 
Inj f Uri tract 
2014 Pain in eye anes aed is 
abdomen pain 
Al ia; 
aa : Skin lesion (on 
On exam. - Neoplastic a 
2015 ; . exam.); Pain in 
inspection of disease 
' eye 
vomit 
On exam. - iene 
2016 ti f Al 
inspec ion fe) een opecia 
vomit 
On exam. - 
inspection of 
: : vomit 
2017 Injury of Skin lesion (on Utinary tract 
abdomen exam.) 
pain; 
Neoplastic 
disease 
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Year HI #1 HI #2 HI #3 si 
: S i 2 1. 
Al ‘ ~ o 
On exam. - A eee 2&e| § | BS] al Se 
: : Pain in eye; Skin lesion (on 3 5S S >| 4 Se S =e 
2018 inspection of 5 = = 5 s =| 3) Ss] 8 £5 
Injury of exam.) v >) «Sf Si | eae ~ 2S 
vomit Year 2 | Ss a = = 2 = aS 
abdomen iS) & = S ie 2 Ss s Lg 
: aoe x ys a x 0 fos 
Injury of Pain in eye; PSP QqQ oo oe Sx 
2019 jury Cough y cS} 5 9 28 mq S 
abdomen Alopecia = s =| Os 
Skin lesion (on 
Injury of Firm lymph node exam.); Closed 
2020 as — pos 2014 | | cf eo | of BI ele] e| o 
abdomen (on exam.) fracture of hip; 
Sore mouth 
2015 = ° ° Se |) 7) Ge ° ° ° 
The occurrences of such HIs during the years are reported in 3016 2.) 2 + a) ela! & s 7 
Tables 4 and 5, and in Figure 2. 
2017 | fio QD o| DS] Mo ° o 
2018 ° ° ° mt oe ee S ° ° 
— 2019 —) ie —) 3 ol oe ° ° ° 
= 2020 as oO oO S colo al oO o 
fo) Ke) ee) ise] 
TOT| 2) RF} a 8 BFS S&S 8 N 


Table 5: The three most diagnosed health issues for cats 
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= te Sy eee = 
re) a ‘i 
Figure 2: Distribution of the three most diagnosed health 
i 2010 =) o o o on o + + oO o 
issues per year, for dogs (up) and cats (down) 
2011 + o o o o Oo} in ro) ro) ro) 
Table 4: The three most diagnosed health issues for dogs 2012 | = = = a S| sa a 5 = 
2013 | + ° ° — co oe) = co oe ° 
= s is) 1 aS 
S. = S| = & ~ ro) ro) ro) ro) 00 ro) ro) fon ro) ro) 
2 e S S| Ss @ SF 2014 
SI 6 8 i iS 
iS <= so = 2 cS S > = = = 2015 loo ° ° ° ° in | © + + ° 
Year Ss SS Sey ee a 
cS) 8 oS | os 4. 83 & Ls 2016 | + ° ra) S Ln AS ros) °o ° 
< 26) >) 2s 8) 2) ss 
= S = S| s > zs 2017 | © ° ° ° ~ ~~) ox ° Ln ° 
= 5 2" 
= 2018 | ololo in | of nw Ln < ° 
2010 co) co) Be 3 || Se co) co) co) 2019 foe) ro) ron ro) 2 ro) ro) 00 ro) ro) 
i=) a oO N ioe) i=) j=) j=) a a 
2011 |. 6.) & ° S| Fe] 6 69 2020 
TOT | <= = a m i): ei) oe 4 = 4 
2012 af ° ° o| alo NX ° co ot in + + 4 
2013 4 | ° co Fo 8 ° ° The total number of occurrences are depicted in Figure 3. In 
both cases, it is worth noticing the presence of neoplastic 
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diseases (dogs: N = 263; cats; N = 9) and firm lymph node- 
related (dogs: N = 95; cats; N= 3) diagnoses. Moreover, 
considering animals’ age of birth (spanning from 1984 to 
2020), it was possible to compare for each trimester the 
diagnoses of firm lymph nodes and neoplastic diseases. This 
revealed that the 44% cases of dogs of the same age, and the 
5% cases of cats of the same age presented a number of 
occurrences of firm lymph node-related diagnoses greater or 
at least equal to neoplastic diseases diagnoses, thus inducing - 
at least for dogs - the reasonable hypothesis of an existing 
connection between the two pathologies. 

Furthermore, Figure 3 reports the occurrences of those 
diagnoses which can be somehow related to the transmission 
of zoonoses, from tetanus (N = 1 for dogs) to vomit (dogs: N = 
254; cats: N = 53). The number of occurrences of such 
diagnoses is the 7% of the total occurrences registered in 
OVUD for dogs and cats for the period considered. 


$1 
46 
4 
aT 
M4 
9 3 
1 
3 
1 1 

C—O 

3 3 FA * s E E H : 


Figure 3: Total occurrences for the three most diagnosed 
health issues, for dogs (up) and cats (down) 


2.4 Dataset 


A preliminary step of dataset cleansing was necessary, 
especially for what concerns the health diagnoses, as no form 
of clinical standardized terminology had been deployed. 
Moreover, for about 30% rows (N = 3729), such type of data 
was missing, and only in a limited number of cases it was 
possible to get to it anyway by means of the analysis of the 
remainder fields of the database. Eventually, the total number 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, June 14-16, 2021, Montreal, QC, Canada 


of participants considered in the model were 10108. Health 
Issues (as already done for reported in Tables 1 and 2) were 
categorized according to the vet-SNOMED terminology 
[15,16], as developed by the Veterinary Medical Informatics 
Laboratory at Virginia-Maryland College of Veterinary 
Medicine: for each of them, the corresponding Concept was 
identified, together with the related SNOMED hierarchy level, 
the Concept ID, and the Preferred Description (Synonym) ID 
[17,18]. Given the mentioned importance of identifying the 
presence of neoplastic diseases-related and/or zoonoses- 
related diagnoses, the need emerged to figure out a way to 
predict the presence of symptoms for both the issues 
considered - for both dogs and cats, who also happen to live 
very close to humans. In particular, according to what 
depicted in Figure 4, for what concerns zoonoses it was 
decided to consider for the analysis the diagnosis of 
“inspection of vomit”. 


53 55 54 


Figure 4: Total occurrences of zoonosis-related 
diagnoses, for both dogs and cats 


Table 6 reports the vet-SNOMED classification for the two 
health issues considered in the study. 


Table 6: Veterinary SNOMED classification of the 
investigated diseases 


: Neoplastic On examination - 

Type of Disease . ; 
Disease inspection of vomit 

Classification , bec oange. tae 
Disorder Clinical finding 

(Is a) 

ConceptID 

(according to 55342001 308637007 

VTSL) 

Preferred 

DescriptionID 

(according to 92007014 451951011 

VTSL and 

AAHA) 

Acceptable Neoplasia O/E - inspection of 

synonym vomit 
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Type of Disease Neoplastic On examination - 

YP Disease inspection of vomit 
Se 1231453017 2670302013 
DescriptionID 


2.5 DT ID3 Feature selection algorithm 


The implementation of a Decision Tree Algorithm (DT) 
appeared as the most suitable way to investigate the 
membership of the subjects to different categories (diagnosed 
with neoplastic disease, or not; diagnosed with vomit, or not), 
taking into account the values of specific attributes (predictor 
variables), which in our case were identified for both cases as: 
animal sex, diet, vaccination, feeding routines, and living 
environment (plus the eventual presence of diagnosis of firm 
lymph nodes, for neoplastic diseases). In order to achieve 
these goals, a filter-based strategy using DT ID3 (Iterative 
Dichotomiser 3) was proposed [19]. As it is common in data 
mining methods to divide the dataset into two parts, also in 
our case the original sample was split into a training set (to 
train the model), and a test set (to evaluate the performance 
of DT ID3). In particular, the original Training dataset for the 
DT (oTrDS) featured all the accesses of dogs and cats to the 
OVUD between 2010 and 2018 (N = 8643; 86%), while the 
original Testing dataset (oTeDS) comprised the remaining 
accesses between 2019 and 2020 (N = 1465; 14%). The 
reason why it was not respected the common rule according 
to which oTrDS ~ 70% sampling data, and oTeDS ~ remaining 
30%, mainly depends on two factors: (i) the reduced accesses 
to OVUD in 2016 due to the mentioned structure collapse, 
and; (ii) available data from year 2020 only cover the first six 
months. Since the aim of the study was to make prediction for 
two kind of health issues, each per two animal species, four 
specific Training datasets (sTrDS) and four specific Testing 
datasets (sTeDS) were extracted from oTrDS and oTeDS, 
respectively. For each case, a confusion matrix was used to 
evaluate the performance of the DT for classification of 
participants. Accuracy, sensitivity, and specificity were then 
measured for comparison. For sake of simplification, decision 
tree and confusion matrix have been represented in the 
following for one case only (presence of symptoms for 
neoplastic disease in dogs). A comparison was _ instead 
conducted for the performances of all four algorithms. 


3 Results 


A decision tree was built starting from the sTrDS related to 
the recognition of neoplastic disease for dogs (N = 8927). The 
sTeDS (N = 1305) was used to evaluate the model. The input 
variables were animal sex, diet, vaccination, feeding routines, 
living environment, and eventual presence of diagnosis of firm 
lymph nodes. As seen, since for dogs the possibility of a 
correlation was recognized between the diagnoses of 
neoplastic disease and firm lymph nodes, the number of 
subjects positive for both health issues (ND+ and L+) was 


IDEAS 2021: the 25th anniversary 


Tamburis et al. 


reported in the algorithm. ID3 uses two metrics to measure 
the importance of the input variables, or features, such as 
entropy (the measure of the amount of uncertainty) and 
information gain (the difference between the entropy of the 
DS, and the one related to the single feature). So, be DS a given 
dataset, and X the set of variables in DS. For each x € X, the 
less the entropy, the more the information gain. For each 
iteration, the algorithm selects the feature with the smallest 
entropy/largest information gain value. The final decision tree 
with size 15, 8 leaves and 5 layers is shown in Figure 5. 


Figure 5: Decision Tree to evaluate the presence of 
symptoms for neoplastic disease in dogs 


The evaluation of the tree was undertaken using confusion 
matrix on a testing dataset, as shown in Table 7. The algorithm 
had an Accuracy of 99%: of the 70 animals diagnosed as ND+ 
in the sTeDS, 60 were correctly classified using the DT. In a 
subordinate position, of the 50 animals diagnosed as L+, 37 
were correctly identified. The specificity and sensitivity of the 
tree were equal to 99,2 and 1, respectively. 


Table 7: Confusion Matrix of sTeDS related to the 
recognition of neoplastic disease for dogs 


Predicted Outcome 

ND+ ND- 
Expected ND+ 60 (TP) 10 (FP) 
Outcome ND- 0 (FN) 1235 (TN) 


The performance of DT was also reported in Table 8. An 
overall comparison was instead conducted between the 
performances of the algorithm for the four cases investigated, 
as reported in Table 9. 


Table 8: Performance of the DT ID3 model for the case 
investigated 


Decision Tree Model 
1 (93,9 - 1) 


Variable 
Sensitivity (95% CI) 
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Variable Decision Tree Model 
Specificity (95% CI) 99,2 (98,5 - 99,6) 
Accuracy (95% CI) 99 (98,3 - 99,4) 


Table 9: Comparison of the performances of the DT ID3 
algorithm for all the cases investigated 


Accuracy Sensitivity | Specificity 
N lasti 
a ea 99% 100% 99,2% 
Disease 
Dogs Zoonosis 
0 0 0 
(Vomit) 100% 100% 100% 
N lasti 
iikeiaeas 100% 100% 100% 
Disease 
Cats Zoonosis 
0 0 0 
(Vomit) 100% 100% 100% 


Although the numbers of cats-related diagnoses extracted 
from the PONGO DB were significantly lesser than the dogs- 
related ones, the overall results obtained confirmed anyway 
the validity of the data mining algorithm implemented, which 
turned as highly capable of modelling the process of 
healthcare provision [20], as well as of setting forth reliable 
measurements of system performance and outcomes [21]. 


4 Discussion and Conclusions 


In this paper a decision tree algorithm was implemented, 
starting from the database of the University Veterinary 
Hospital of the Federico II University of Naples, to work out a 
predictive model for an effective recognition of neoplastic 
diseases and zoonoses using clinical data, according to clinical, 
para-clinical, and demographic attributes. The main scope was 
to investigate whether and at what extent relations can stand 
between human and animal health, and their surrounding 
environments. The whole set of disciplines broadly dealing 
with the such kind of “connecting chain” goes under the name 
of One Health (OH), introduced for the first time as part of the 
twelve “Manhattan Principles” calling for an international, 
interdisciplinary approach to prevent diseases [22] and 
specifically animal-human transmissible and communicable 
ones. The raising interest towards the manifold competence 
areas of OH has led to the development of new disciplines 
such as digital epidemiology and public health informatics 
[23], with the purpose to improve understanding of health 
risks, effectiveness of management and policy decisions [24]. 
In this spirit, the seminal idea of “One Health Informatics” 
(OHI) was proposed as connected to the deployment of big 
data analytics to support and improve public health and 
medical research and address issues related to e.g. 
biodiversity control, disease monitoring, or control of 
zoonoses [25,26]. This was especially possible thanks to the 
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ongoing remarkable progress in, and the evolution of, the field 
of health and biomedical informatics (for humans), which is 
making increasingly manifest how health information 
technology (HIT) improves health, health care, public health, 
and biomedical research [27]. The same level of attention 
started increasing for veterinary disciplines as well, under the 
name of “veterinary (medical) informatics”: in the last years in 
fact, a significant development is being witnessed also for 
animal health for what concerns the _ information 
infrastructure that supports the knowledge translation 
processes of exchange, synthesis, dissemination, and 
application of the best clinical intervention research. In 
particular, the evolution of timely solutions of Electronic 
Medical Records allows achieving the same benefits as in 
human health, in terms of increased quality of care, better 
coordination among vet professionals, efficient care path 
design, etc. In particular, EMRs for pets (dogs and cats, 
mainly) share structure and objectives with human-focused 
products. Furthermore, new generations of vet-EMRs allow 
the recording of data as well for single animals from a herd, 
thanks to more functional interfaces with animals’ electronic 
identification tools. 

Seen under this comprehensive point of view, the bursting of 
dynamics connected to the emerging and re-emerging of 
infectious diseases from national to supranational contexts, as 
well as the need to identify at global level risk factors and 
causes of health problems that arise at the human-animal- 
environment crossing, made even more remarkable the role of 
veterinarians towards the protection of human health. This 
points out therefore the growing of veterinary informatics, as 
also encompassing the need for new paradigms, approaches 
and technologies to reinforce the capacity of traditional 
surveillance systems for prevention and control of zoonoses, 
in terms of i.e. inter-sectoral coordination, link between 
human and animal health data and consequent management 
of flows of reliable data and information, or proper use of 
infrastructures, systems and human resources to detect 
outbreaks [28]. 
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ABSTRACT 


Quantum computers are known to be efficient for solving com- 
binatorial problems like finding optimal schedules for processing 
transactions in parallel without blocking. We show how Grover’s 
search algorithm for quantum computers can be applied for finding 
an optimal transaction schedule via generating code from the prob- 
lem instance. We compare our approach with existing approaches 
for traditional computers and quantum annealers in terms of prepro- 
cessing, runtime, space and code length complexity. Furthermore, 
we show by experiments the expected number of optimal solu- 
tions of this problem as well as suboptimal ones. With the help of 
an estimator of the number of solutions, we further speed up our 
optimizer for optimal and suboptimal transaction schedules. 
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1 INTRODUCTION 


For the universal quantum computer, Grover [14] developed an 
algorithm for finding (with a high probability) the input to a black 
box function with a particular result to be searched for. In compari- 
son to traditional computing, Grover’s search algorithm achieves 
a quadratic speedup, which is to be optimal among all possible 
quantum algorithms [3, 14]. 
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Grover’s search often serves as basis for algorithms solving com- 
binatorial optimization problems [2] by 
e encoding candidates of the solution as an integer number, which 

is the input of the black box function of Grover’s search, and 
e using a function to check if a candidate solution has costs below 

a given threshold as black box function for Grover’s search. 

We analyze this pattern of solving combinatorial optimization 
problems on quantum computers with the optimizing transaction 
schedules problem [5, 6]. The optimizing transaction schedules 
problem is a variant of the job shop scheduling problems (JSSP), 
where jobs are assigned to machines (i.e., cores of a multi-core CPU 
in the domain of parallel computing jobs) with the optimization 
goal to minimize the overall processing time. For the optimizing 
transaction schedules problem, the jobs are transactions, which 
might block each other because of accessing at least one same data 
object, and conflicting transactions are scheduled to not run in 
parallel avoiding blocking of these transactions. 

We propose to optimize the transaction schedule of batches by 
utilizing quantum computers as hardware accelerators [11-13]. By 
using pipelining we can optimize the transaction schedule already 
of the next batch during the processing of the current batch of 
transactions (see Figure 1). In this way, the quantum computer is 
run in parallel to the transaction processing on traditional CPUs in 
order to minimize waiting times and to maximize throughput. 

Grover’s search needs to call its black box function with candi- 
date solutions in superposition. Only a limited set of operations can 
be applied to qubits in superposition, such that not all algorithms of 
traditional computing have corresponding counterparts in quantum 
computing. As a consequence for many combinatorial optimization 
problems and also for the optimizing transaction schedules problem, 
the code (i.e., the combination and orchestration of the quantum 
logic gates) of the black box function needs to be generated for the 
specific instance of the considered problem. We will show how to 
generate the code of the black box function for an instance of the 
optimizing transaction schedules problem. 

Whenever the black box function doesn’t need constant time, but 
its runtime depends on the instance of the considered problem, the 
overall runtime complexity of solving the considered combinatorial 
optimization problem is higher than the runtime complexity of 
the pure Grover’s search. Hence, we will analyze the runtime com- 
plexity of solving the optimizing transaction schedules problem. 
Additionally, we analyze the preprocessing time (needed for gener- 
ating the black box function), space requirements (i.e., number of 
qubits) and code length, and compare these results with the naive 
implementation on traditional computers (enumerating and testing 
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Figure 1: Pipelining the optimization of transaction sched- 
ules and the processing of transaction batches: The batches 
i+2toi+/j of transactions are in the queue for optimizing 
their schedule and processing, the schedule of batch i + 1 
is currently optimized by the quantum computer, batch i is 
currently processed and the results of batch i — 1 are being 
transmitted to the clients. 


all candidate solutions) and a recent solution [5, 6] on quantum 

annealers (specialized quantum computers for solving quadratic 

unconstrained binary optimization (QUBO) problems). 

We have implemented an optimizer for transaction schedules 
by generating the code for the quantum simulator Silg [4] to show 
that the code generation is quite efficient and tests of the generated 
code are successful in Silq. 

Our main contributions are 
e acode generation approach for the black box function of Grover’s 

search running on universal quantum computers for optimizing 

transaction schedules, 

e acomplexity analysis concerning preprocessing and execution 
time, space and code length, and comparison of our approach 
with the ones running on traditional computers and on quantum 
annealers, and 

e an implementation for the quantum simulator Silq [4]. 
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2 BASICS 


In this section, we introduce quantum computing in subsection 2.1, 
Grover’s search algorithm in subsection 2.2 and transaction man- 
agement in subsection 2.3. 


2.1 Quantum computer 


Quantum computers are computers that are not based on classical 
mechanics, instead they exploit the effects of quantum mechanics. 

Quantum mechanics describes the states and behavior of par- 
ticles that are smaller than the size of an atom and do not follow 
the laws of classical physics. At this scale, there occur effects that 
the quantum computer makes use of, especially the principle of 
superposition and that of quantum entanglement. The quantum 
computer uses qubits, which can take on 2 states simultaneously 
due to the principle of superposition. While a bit can assume the 
state 0 or 1, a qubit assumes the states 0 and 1 simultaneously. If 
a measurement of the state is made, the qubit changes to one of 
the two states. Both states have relative probabilities with which 
they are assumed in the measurement. The principle of quantum 
entanglement enables the mutual influence of qubits, since entan- 
gled qubits mutually influence their probabilities. Imagining the 
principle of superposition as a special form of parallel computing 
opens up a new world of computation beyond polynomial time 
and allows in theory an exponential speedup compared to classical 
computers. 

Universal Quantum computing finds desired solutions of a 
problem by clever manipulation of single qubits as well as entan- 
gled qubits. For the purpose of manipulating qubits, gates pro- 
vide elementary operations on one or two qubits. For example, the 
Hadamard-Gate puts one qubit into superposition and a Controlled- 
NOT(CNOT)-Gate inverts a second qubit depending on the first [1]. 
A quantum computer is thus able to execute quantum algorithms 
like Shor’s algorithm for factorizing large numbers [19] or Grover’s 
algorithm for searching in huge unsorted databases [14, 15]. 


2.2 Grover’s search algorithm 


Grover [14] showed that his search algorithm has an expected 
runtime of 7 - VN basic steps and claimed based on a result in [3] 
that up to a multiplicative constant among all possible quantum 
algorithms the optimality of his algorithm can be proven. Variants 
[7, 18] of Grover’s search propose applications running on average 


inu- | x basic steps without knowing the number k of solutions 
in advance. The variants differ upon being randomized [7] and 
deterministic [18], and the constant u, where [7] achieves u = 2 and 


[18] u= Sz Whenever the number k of solutions is approximately 
known in advance, then Grover’ search is able to find a solution in 


aes basic steps, such that a speedup of Vk can be achieved in 


comparison to when only one solution is present, a speedup of 2.86 
in comparison to [7] and 10.66 in comparison to [18]. 


2.3 Transaction Management 


A transaction t =< 51, ..., 5);; > of length |t| is a series of operations 
sj, carried out by a single user or application program, which reads 
or updates the contents of the database [9]. Each operation s; = 
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aj(e;) consists of the type of access a; € {r, w}, where r represents 
a read access and w a write access, and the object e; to be accessed. 

For example, the transaction t =< r(A), w(A), r(B), w(B) > is 
of length |t| = 4. 

Since transactions often need to access a database at the same 
time and thereby often reference the same objects, transactions in a 
database system are required to fulfill the so-called ACID properties 
[9]. The focus here is on the fulfillment of the (Isolation property, 
which guarantees to avoid problems of unsynchronized parallel ex- 
ecution of several transactions and requires concurrently executed 
transactions to not influence each other. 


2.3.1. Conflict Management. To ensure the isolation property, con- 
flicts between transactions must be dealt with. The simplest strategy 
to deal with conflicts would be to execute the transactions serially. 
Since this is too slow for operational systems and also many transac- 
tions do not conflict with each other at all, in practice transactions 
are executed in parallel. There are various approaches to dealing 
with the conflicts that arise, one of them is the use of locks. For this 
purpose, each transaction acquires a lock for an object before access 
and releases it after access, thus preventing concurrent access to 
an object. 

The two-phase-locking protocol (see Figure 2) is a locking pro- 
tocol that requires each transaction consisting of two subsequent 
phases, the locking phase and the release phase. During the lock- 
ing phase, the transaction may acquire locks but not release them, 
whereas the release phase requires the release of previously re- 
quired locks. If a transaction has released a lock, it may not acquire 
any new ones. There is a distinction between read locks (also called 
shared locks) and write locks (also called exclusive locks). Variants of 
the protocol include the conservative two-phase-locking protocol 
and the strict two-phase-locking protocol. The conservative vari- 
ant (preclaiming) requires that all locks that are required during 
a transaction are acquired before the transaction is started.! The 
strict variant holds all locks until the end of a transaction, which 
avoids cascading aborts occurring in case of so called dirty reads of 
objects written by transactions, which are later aborted resulting 
in an abort of the current transaction as well. In this contribution, 
we consider the combination of the conservative and the strict two- 
phase-locking protocol in our transaction model, which results in a 
serial execution of all transactions that access the same objects. 


2.3.2 Conflicts. Let T be the set of transactions, D be the set of 
data objects. Two transactions i € T and j € T are in conflict with 
each other if there exists two operations of these transactions being 
in conflict with each other. Two operations aj(e) € i and aj(e) € j 
of the transactions i and j are in conflict with each other if they 
access the same object e € D and at least one of the operations is 
writing the object: 


a;(e) in conflict with a’-(e) if 
Fi, j € T,e € D,aj(e) € i, a(e) ej: 


iZxTA (aj = w Va’, = w) 


‘Please note that for those transactions, for which the required locks are not known 
before processing, the required locks can be determined by an additional phase before 
transaction processing. The contribution in [20] describes such an approach which 
can be also applied in our scenario. 
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Figure 2: a) Two-phase-locking protocol with locking phase 
and release phase, and b) the strict conservative two-phase 
locking protocol 


We assume that accesses to the same object e € D are serialized 
and we use the notation aj(e) > a’, (e) in order to denote that the 
operation aj(e) is executed before a’(e). The isolation property is 
therefore fulfilled if all conflict operations of conflicting transac- 
tions i € T and j € T (i # j) are processed in the same order of 
transactions: 


(Vai(e) E i, a’,(e) ej: 
(aj =w Va; =w) = aj(e) > a'(e)) 
V 
(Vai(e) eT a;(e) Ej: 
(aj =wVa,=w) >a‘(e) > ai(e)) 


The two-phase-locking protocol fulfills the isolation property 
by guaranteeing that each object is continuously locked from its 
first to last access and during the entire processing. An operation 
aj(e) of a transaction i is suspended if another transaction j (i # j) 
already holds a lock to the object e (and the held lock or the lock to 
be acquired is an exclusive lock). In this way it is guaranteed that 
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all conflicting operations of transaction j are executed before those 
operations of transaction i. When using preclaiming of locks, ac- 
cording to [20] there are no real limits (by determining the required 
locks before transaction processing in an additional phase), while 
the occurrence of deadlocks is eliminated, where transactions wait 
for releasing the locks of each other. 


2.3.3. Optimal Transaction Schedules without Blocking. For an ex- 
ample of transactions, see Figure 3, which block each other as in 
Figure 4. In our example, we want to run these transaction on 
multi-core CPU with three cores (see Figure 5). In optimal solutions, 
transactions blocking each other do not run in parallel. We present 
one optimal solution in Figure 6. We already see that there will be 
many more optimal solutions. For example, by just exchanging the 
cores, e.g., core 2 runs the transactions of core 3 in Figure 6 and 
vice versa, we already get 3! = 6 equivalent transaction schedules. 
Hence, we recognize that there will be many optimal solutions, 
which speeds up Grover’s search algorithm. 


transaction 


length 


Figure 3: n = 8 transactions with their respective lengths 


Figure 4: Black fields indicate blocking transactions of the 
transactions of Figure 3 
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Figure 5: Transactions are scheduled on three cores of a multi- 
core CPU, the maximum runtime here is 10 time units 
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Figure 6: One optimal solution of a transaction schedule for 
the transactions of Figure 3, which block each other as in 
Figure 4, on three cores of a multi-core CPU without any 
blocking with a runtime of 9 time units 


3 OPTIMIZING TRANSACTION SCHEDULES 
BY UNIVERSAL QUANTUM COMPUTERS 


We propose the following approach for optimizing transaction 
schedules on universal quantum computers: The code for the black 
box function for Grover’s search is generated (see subsection 3.2) 
based on the transaction lengths and conflicts between the transac- 
tions (see Figure 7). Because Grover’s search needs less iterations 
for those problem instances with more than one solution, we esti- 
mate the number of solutions (see subsection 3.3) for our considered 
transactions in order to speed up our approach. The input of the 
black box function for Grover’s search is an integer number in su- 
perposition of a given number of qubits. In our approach, the input 
number in superposition represents possible transaction schedules 
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Algo determineSchedule 
Input: —p:{0, ..., 2-1} 
Output: {0, ..., n-1}""-tx 
{O, ..., n-1}""1 
for(x in 1..m-1) 
Ly, =p modn 
p=pdivn 
a = [0, ..., n-1] 
for(i in 0..n-1) 
J =p mod (n-i) 
p =p div (n-i) 
ni] = alj] 
a[j] = a[n-i-1] 
return (111, ..., Um—1, 7) 
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Example (n=4, m=2): 


29 

x=1 

Hy =1 

p=7 

a = [0, 1, 2, 3] 

t=0: i=1: i=2 i=3 
J=3 = =0 j=0 
pHi p=0 p =0 p =0 
m[O] =3 |z[1]=1 |x[2] =0 |rx[3] =2 


a[3]=a[3]la[1]=a[2]ja[0]=a[1]la[0]=a[0] 
return (1,[3,1,0,2]) 


Code Estimator 
Generation jag 1" Number of 
solutions 


Check: 
{0, ...,2 — 1} > {True, False} 


Grover’s Search 


Encoding Scheme of Solution: 
{0,...,229-1}> a 


Transaction Schedule @ 


Core 
0 


Ta(0,0) °* Ta(0,m, —1) 


Core 


ried Tam - 1,0) *** Tam -1,m,, — 1) 


Figure 7: Overview of the code generation approach for opti- 
mizing transaction schedules. 


and Grover’s search returns one integer number of those which 
successfully passed the black box function. Hence we need a map- 
ping of a given integer number to transaction schedules a, which 
is also called encoding scheme for transaction schedules (see sub- 
section 3.1). 


3.1 Encoding Scheme for Transaction Schedules 


We describe the representation of a transaction schedule, i.e., the 
sequence of transactions for each machine (i.e., core of a multi-core 
CPU), by an integer number and determine the number of used 
bits. 

We propose to split the representation of a transaction schedule 
into two components: The first component represents a permuta- 
tion of the given transactions and the second component determines 
where to split the permutation of transactions into m subsequences 
to be scheduled to the m machines. 
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Figure 8: Algorithm for determining a transaction schedule 
from an integer number with example 


Because there are n! possibilities for a permutation z of n items, 
our domain for representing the permutation of n transactions 
is {0,...,n! — 1} for which we need [log2(n!)] bits. For decoding 
p € {0,....n! — 1} to a permutation 7 of transactions, we use a 
variant [8] of Lehmer’s code [16] with runtime complexity O(n). 

Representing splitting a permutation of the n transactions into 
m subsequences to be scheduled to the m machines, we need m — 1 
markers pi. We propose to use the domain {0, ..., n—1} for each of the 
markers and assume that at least one transaction must be processed 
on the first machine. We require 0 < J < fo <... < Um-1 <n - 1, 
such that otherwise it is not a valid transaction schedule to be 
discarded by the generated black box function of Grover’s search. 
Defining po := —1 and pp, := n — 1, the scheduled transactions 
on machine my are < (flm,_, + 1),...,T({m,.) >. The function 
a(x,i) to determine the i-th transaction on machine x is hence 
defined as a(x,i) = 1(uUm,_, +1+i), where x € {0,...,.m— 1} and 
i € {0,..., Um, — Umy_1 — 1}. 

We present the complete algorithm for determining a transaction 
schedule from an integer number with example in Figure 8. 

For each marker we need [log2(n) | bits, such that we overall 
need [log2(n!)| + (m — 1) - [log2(n)| for the whole representation 
of the transaction schedule. 


3.2 Code Generation Algorithm 


The black box function for Grover’s search requires qubits as input. 
Only a limited set of quantum computing operations, i.e., quan- 
tum logic gates, can be applied to these qubits. Furthermore, the 
sequence of quantum computing operations must be fixed in a 
circuit before the quantum computer starts its processing, i.e., at 
compile time, such that the quantum computer can be configured 
accordingly. This is the case for the algorithm for determining a 
transaction schedule (see Figure 8) whenever the number n of trans- 
actions and the number m of machines are fixed at compile time and 
don’t depend on the input, such that e.g. the contained loops can be 
rolled out already at compile time. Hence the algorithm in Figure 8 
can be used within the black box function of Grover’s search after 
changing the input and output types to qubits. Furthermore, due 
to the limited set of quantum computing operations qubits can e.g. 
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Algo generateBlackBox 


Input: n: N // number of transactions 
l: N” // lengths of transactions 
m:N // number of machines 
c: N // number of conflicts 
o: {0, ..., n-1}2* © //conflicts 
R:N // maximum runtime on each machine 


Output: code 


“Algo blackBox“ 

“Input: t:${[log,(n!)] + (m— 1) x [log,(n)]} qubits” 
“Output: {True, False}“ 

“(Hi 1 Ugem—1) 1) = determineSchedule(t)“ 

“Um = ${n — 1)" 

// Is this schedule valid? 

“result = Vi € {2,...,${m — 1}} wy < wy“ 

for(i in 0..n-1) 


“tg; = switch(z[$i])“ // determine length of transaction [i] 


for(j in 0..n-1) 
“case $7: l[$j]“ 


determineTimes(-1, 1, m, Ly, ..-, Lm, {s|(s,d) € o V (d,s) € o}, R) 
for((0;, 02)€ 0) //Are conflicting transactions overlapping? 


“if(not(t$o0,e < t$o2s or thoze < t$o,s)) result = False“ 
“return result“ 


Algo determineTimes 


Input: q: {0,...,n — 1} // current transaction in 1 
m,:{0, ...,m — 1} // currently considered machine 
m:N // number of machines 


4, +» Um: {0, ...,2 — 1} // markers for machines 


os: set({0,...,n —1}) //set of transactions in conflicts 
R:N // threshold of max runtime on each machine 


Output: code 


if(m, = mort, =n) 
return 
for(i in O..n-1) 
“iflusm, = $i)" 
© Smee TO Tyas Opp es pag 1" 
// check runtime of this machine 
“ if(times[${i + 1}]>R) result = False“ 


// determine start and end time of conflicting transactions 


for(j in q+1..i) 
for(o € os) 
“~ if(m[$j]==$0)“ 
° t$oa = times[${j — 1}]“ 
“ —— t$oe = times[$j]“ 
“+ determineTimes(i, m,.+1, M, Ly, ..., Um, OS, R) 


“ 


Figure 9: Algorithm gener ateBlackBox and its called algorithm 
determineTimes for generating the code of the black box func- 
tion for Grover’s search. Generated code is represented by 
string templates, which might contain $v and ${...} expres- 
sions to be replaced with their computed value. For large sets 
of conflicting transactions, the assignment of start and end 
times of these transactions can be further improved by using 
some kind of decision tree (with O(log2(min(n, c)))) over the 
conflicting transactions (for n >> c) or the transaction num- 
bers (for c >> n) for checking if the considered transaction 
is a conflicting transaction instead of a sequential check. 
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Algo blackBox 
Input: t: 7 qubits 
Output: {True, False} 
(U4, 1) = determineSchedule(t) 
M2 =2 
result = ly < Uy 
to = switch(z[0]) 
case 0: 2 
case 1:4 
case 2:1 
t, = switch(z[1]) 
case 0:2 
case 1:4 
case 2: 1 
t2 = switch(z[2]) 
case 0: 2 
case 1:4 
case 2:1 
if(u, = 0) 
times = [0, tg] 
if(times[1]>R) 
result = False 
if(7z[0]==0) 
t0a = times[0] 
t0e = times[1] 
if(77[0]==2) 
t2a = times[0] 
t2e = times[1] 
times = [0, ty, tz] 
if(times[2]>R) 
result = False 
if(77[1]==0) 
t0a = times[0] 
t0e = times[1] 
if(77[1]==2) 
t2a = times[0] 
t2e = times[1] 
if(77[2]==0) 
t0a = times[1] 
t0e = times[2] 
if(7[2]==2) 
t2a = times[1] 
t2e =times[2] 
if(u; = 1) 
times = [0, to, to + ty] 
if(times[2]>R) 
result = False 
if(77[0]==0) 
t0a = times[0] 
t0e = times[1] 
if(77[0]==2) 
t2a = times[0] 
t2e = times[1] 
if(7z[1]==0) 
t0a = times[1] 
t0e = times[2] 
if(77[1]==2) 
t2a = times[1] 
t2e = times[2] 
times = [0, tz] 
if(times[1]>R) 
result = False 
if(7z[2]==0) 
t0a = times[0] 
t0e = times[1] 
if(77[2]==2) 
t2a = times[0] 
t2e = times[1] 
if(not(t0e < t2s or t2e < t0s)) 
result = False 
return result 


Figure 10: Example of generated black box for m = 2, n = 3, 
T = [2,4,1] and O = {(0,2)} in pseudo code. Some compiler 
optimizations have already been applied like evaluating con- 
stant expressions at compile time and dead code elimination. 
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neither be used as indices in arrays nor for indirect variable ac- 
cesses (like in php $$name), such that often a huge set of variables 
holding values in superposition must be hard coded. Hence there 
is a need for generating code of the black box function for Grover’s 
search for the specific problem instance (here the configuration 
of transactions including their lengths and conflicts between each 
other). 

We present the overall code generation algorithm in Figure 9 
and an example of a generated black blox for Grover’s search in 
Figure 10. Please note that the code sometimes is not as elegant as 
software developers of modern programming languages are used 
to, but are due to the limited set of quantum computing operations 
to be arranged in a circuit. The code generation algorithm is open 
source and publicly available?: We generate code for the Silq [4] 
quantum computer programming language and simulator, with 
which we have extensively tested our generated code. 


3.3 Estimating the number of solutions k 


We already discussed in subsection 2.2 that Grover’s search has a 
speedup with a factor of Vk for the number k of solutions. We have 
run experiments to determine k with 12 transactions and a length of 
10 with standard deviation of 3 varying the number of machines and 
the number of conflicts (see Figure 11) to show the effects in terms 
of speedups. N is 8,589,934,592 for m = 2 and 2,199,023,255,552 for m 
= 4, but already many optimal solutions exist: 48,384,000 (resulting 
in a speedup of 6,955) for m = 2 and 559,872 (resulting in a speedup 
of 748) for m = 4 (c = 0). Determining only suboptimal solutions 
achieves huge speedups: solutions being close to 25% to the optimal 
solution can be determined with speedups of 38,374 for m = 2 and 
45,247 for m = 4. Hence we highly recommend to run experiments 
for typical transaction configurations of the used application and 
based on these experiments estimate the number of solutions for 
speeding up Grover’s search. 


3.4 Complexity Analysis 


In this section we compare the complexities according to prepro- 
cessing, runtime, space and code size of solving the optimizing 
transaction schedules problem on a traditional computer, universal 
quantum computer and quantum annealer. Table 1 summarizes the 
results of the complexities comparison. 


3.4.1. Complexity of Algorithm on Traditional Computer. The job 
shop scheduling problem (JSSP) as simpler variant of the optimizing 
transaction schedules problem (without considering blocking trans- 
actions) is already among the hardest combinatorial optimization 
problems [10]. Hence the considered algorithm on the traditional 
computer enumerates all possible transaction schedules: For each 
transaction schedule it determines the runtimes in O(n) on each 
machine to determine the minimum execution time and checks 
that conflicting transactions do not have overlapping execution 
times in O(c). There are n! possibilities to reorder the transactions 
(i.e., number of permutations of the n transactions) and the dis- 
tribution of a permutation of n transactions to m machines is the 
same as the number of combinations of m objects taken n at a time 


*https://github.com/luposdate/OptimizingTransactionSchedulesWithSilq 
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Figure 11: Cumulative number of solutions |Sp<,| in relation 
to runtime (n = 12, oj, = 3). 
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with repetition, such that O(n! - ( )) possibilities have to be 


enumerated. 


3.4.2 Complexity of Quantum Computing Approach. Grover’s search 
on universal quantum computers takes O( VN) iterations, where 
N is the size of the function’s domain and in each iteration the 
test function needs time O(n - log2(n) +c). Please note that deter- 
mining the length of the i-th transaction in the permutation needs 
O(n - log2(n)) time using a decision tree over the transaction num- 
bers. Determining start and end times of conflicting transactions 
needs O(n - log2(min(n, c)) +c) time by using a decision tree over 
the conflicting transactions (for n >> c) or the transaction numbers 
(for c >> n). For our algorithm, N = O(2!og2(nt)+(m—-1) -loge(n)) — 
O(n!-n”) resulting in an overall runtime complexity of O(-yn! - n’™- 
(n-log2(n)+c)). The overall runtime complexity can be further im- 
proved by estimating the number k of solutions (see subsection 3.3) 


to O(,/ mn -(n-log2(n) +c)), where suboptimal solutions with a 
higher threshold decreases the runtime further. The space require- 
ment directly depends on the number of used qubits, which is in 
O(m - log2(n) + log2(n!)) = O((n + m) - log2(n)) using Stirling’s 
approximation n! ~ V2--n-(2)", where ~ means that the two 
quantities are asymptotic, i.e., their ratio tends to 1 as n tends to 
infinity. 


3.4.3. Complexity of Quantum Annealing Approach. The complex- 
ities of the quantum annealing approach have already been de- 
termined in [5, 6], but because of simplicity of presentation, the 
upper bound n? for the number c of conflicts have been used in 
the complexities presented in [5, 6]. Because we directly use the 
number c of conflicts for Grover’s search algorithm in Table 1, we 
also determine preciser complexities for the quantum annealing 
approach resulting in O(m- R? - (c-m+n7*)) for preprocessing time 
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Preprocessing Execution Time Space Code Size 
Traditional Computer O(1) O(n! - aa -(n+c))) O(n+c+m) O(1) 
= o( (nt) 
Quantum Computer O(n? - c) O(,/ an -(n-log2(n)+c)) O((n+m) - log2(n)) O(n? -c) 


Quantum Annealer [5,6] O(m- R*-(c-m+n’)) O(1) O(m-R*-(c-m+n*)) O(m-R2-(c-m+n’)) 
m: number of machines n: number of transactions c: number of conflicts R: maximum execution time on each machine 
k: estimated number of solutions 
Table 1: Overview preprocessing, runtime, space and code size complexities of the transaction schedule problem for traditional 
and different types of quantum computers. The space complexity represents the growth of the number of required qubits for 


quantum computing, number of variables for quantum annealing and number of bytes for traditional computers. 


(i.e., time to generate the formula to optimize), space (i.e., number of 
binary variables) and code size (i.e., size of the formula to optimize). 
Once initialized, quantum annealers have a constant execution time 
O(1). 


4 SUMMARY AND CONCLUSIONS 


In this paper we propose to use Grover’s search for optimizing 
transaction schedules on universal quantum computers by generat- 
ing its black box function based on the given problem instance. We 
describe the encoding scheme for representing a candidate solution 
and the code generation. We compare the complexities according 
to preprocessing, runtime, space and code length with approaches 
for traditional computers and quantum annnealers. 

In future work, we will investigate improved encoding schemes 
for optimizing transaction schedules using less bits, other combina- 
torial optimization problems and also other applications of Grover’s 
search like quantum cryptography [17]. 
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ABSTRACT 


Big data systems are becoming mainstream for big data manage- 
ment either for batch processing or real-time processing. In order 
to extract insights from data, quality issues are very important to 
address, particularly. A veracity assessment model is consequently 
needed. In this paper, we propose a model which ties quality of 
datasets and quality of query resultsets. We particularly examine 
quality issues raised by a given dataset, order attributes along their 
fitness for use and correlate veracity metrics to business queries. 
We validate our work using the open dataset NYC taxi’ trips. 
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- Information systems — Query reformulation; Integrity checking; 
Data cleaning. 


KEYWORDS 
Veracity, Big data, 


ACM Reference Format: 

Rim Moussa and Soror Sahri. 2021. Customized Eager-Lazy Data Cleansing 
for Satisfactory Big Data Veracity. In 25th International Database Engineer- 
ing Applications Symposium (IDEAS 2021), fuly 14-16, 2021, Montreal, QC, 
Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3472163. 
3472195 


1 INTRODUCTION 


CrowdFlower surveyed over hundred fifty (153) data scientists na- 
tionwide between November 2014 and December 2014 [4]. 52.3% of 
data scientists cited poor quality data as their biggest daily obstacle, 
and 66.7% of data scientists said cleaning and organizing data is 
their most time-consuming task. The unprecedented scale at which 
data is generated today has shown a large demand for scalable data 
collection, data cleansing, and data management. Big data features 
characteristics known as 5 V’s: volume, velocity, variety, veracity 
and value. Volume refers to the amount of data which henceforth 
increased to the range of Tera and Peta Bytes scale; velocity refers to 
the speed at which new data is generated and processed, henceforth 
the challenge is to analyze data while it is being generated; and vari- 
ety refers to different types of data; e.g. structured (relational data), 
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semi-structured (XML, JSON), unstructured (text, image, video). 
While the three first V’s are well-defined, Veracity and Value are 
not well-defined, and consequently hard to measure. Moreover, if 
the data is inaccurate, imprecise, incomplete, inconsistent or has 
uncertain provenance, then any insights obtained from data ana- 
lytic are meaningless and unreliable to some extent. Data scientists 
conduct data analysis with, e.g., incomplete data, and provide error 
bars to reflect the uncertainty in the results of the analysis. They 
obtain errors’ snowball in case of big impacts of errors and butterfly 
effect in case of small errors impacts [15]. Hereafter, we present the 
motivations of our research, 


e Absence of Standards and Norms: there is still no consensus on 
what the notion of Veracity is, and consequently no standard 
approach for measurement of veracity. 

e Disconnect between data source and data use: Data users and 
data providers are often different organizations with very 
different goals and operational procedures. In many cases, 
the data providers have no idea about the business use cases 
of data users. This disconnect between data source and data 
use is one of the prime reasons behind the data quality issues. 

e Big data and Distributed Architectures: Processing Big Data 
requires highly scalable data management systems. Perfor- 
mance aspects should not be overlooked. In distributed data 
management systems, the quality issues are particularly dif- 
ficult due to (i) large datasets distributed across multiple 
nodes, (ii) query processing involves multiple nodes, and 
(iii) new data batches and real-time data alter the database 
and consequently quality and veracity metrics become stale. 

e Schema-on-read and the move from traditional Data Ware- 
house Systems To Data Lakes and Lakehouses’ Architectures 
[17]: data lakes and lakehouses are a "catch all" repositories, 
where schema-on-read doesn’t impose a specific model for 
all datasets. Consequently, problems such as data quality, 
provenance, and governance, which have been historically 
associated with traditional data warehouses are more com- 
plex in a data lake or lakehouse system architecture [14]. 


In this paper, we propose a suitable model dealing with Veracity 
characteristic assessment. We come up with a model which ties 
quality of datasets and veracity of query resultsets in a distributed 
data store. In our framework data quality is defined as being the 
fitness for use of data, our model details how the veracity metrics 
on a given distributed dataset are calculated, and fits the needs 
and requirements of users’ queries. It particularly orders attributes 
along their fitness for use and correlate veracity metrics to business 
queries. We validate our work using an open dataset NYC cabs’ trips. 
The latter contains over a billion of individual taxi trips in the city 
of New York from January 2009 to present. 
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In the rest of the paper, we investigate veracity metrics calculus 
in Section 2. Then, we validate our proposed veracity framework 
using the NYC cabs dataset in Section 3. Section 4 overviews related 
work. Finally, Section 5 recalls the contributions and outlines our 
future work. 


2 VERACITY FRAMEWORK 


In this section, we investigate quality and veracity metrics calculus 
for a big dataset and a given workload. In our model, we propose 
taking into account all of data inconsistency, data inaccuracy, and 
data incompleteness without data repairing them and without drop- 
ping rows from the original dataset. Indeed, the same data can fit 
for one business query, but not for another. We consider different 
cleansing strategies for different queries. We recall that data cleans- 
ing process is complex and consists of several stages which include 
specifying the quality rules, detecting data error and repairing the 
errors. Most of data cleansing approaches undertake a data repair 
action to remove errors from data sources. However, in real-world 
applications, cleansing the data is costly, and may lead to a loss of 
potentially useful data. Moreover, the existing cleansing approaches 
cannot ensure the accuracy of the repaired data and require domain 
expert to validate repair rules. In close related work the consistent 
query answering is ensured without modifying sources [3]. These 
works focus on integrity constraints to express data quality rules, 
as functional dependencies or patterns (ex. Conditional functional 
dependencies). Such rules can capture a wide variety of errors in- 
cluding duplication, inconsistency, and missing values. However, 
consistent query answering was only investigated for relational 
data and not for Schema-on-read and big data datasets. 


2.1 Data Metrics Model 


We assume that data is in row format. Each row describes a record 
with attributes. We overview how some quality metrics are com- 
puted by considering the characteristics of distributed data, in par- 
ticular their fragmentation and replication characteristics. For this, 
let R be a set of datasets {R; }. Attributes {Aj;} describe each dataset 
Rj. 


2.1.1. Data Consistency. It expresses the degree to which a set of 
data satisfies a set of integrity constraints (e.g. check constraints, ref- 
erence constraints, entity constraints and functional dependencies). 
Let, 


e C be the set of constraints {Cj;}, such that {C;;} relate to 
dataset R;. Constraints’ types are functional dependencies, 
check constraint, reference constraint, entity constraint. A 
constraint is defined over a single attribute (e.g. domain 
definition, unique, and so on) or over multiple attributes (e.g. 
functional dependency, measure expression). 


Data consistency metric counts, for each dataset {R;} and each 
attribute {A;;}, the number of rows which violate any of the defined 
constraints {C;j;}. 


e Data Fragmentation: If the dataset is fragmented horizontally 
the calculus of the data consistency of dataset {R;} is first 
performed for each horizontal partition, then all counts are 
summed up. 

If the dataset is fragmented vertically the calculus of the 
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data consistency of dataset {R;} is first performed for each 
vertical partition for each attribute, the overall counts results 
are obtained with union all. 

If the dataset is fragmented horizontally and vertically (i.e. 
hybrid fragmentation), the calculus of the data consistency 
of dataset {R;} is first performed through vertical partitions, 
then horizontal partitions. 

e Data Replication: The refresh protocol is very important. In- 
deed, some protocols are strict and others are not. Most of 
analytical systems fall into BASE systems category (i.e. Ba- 
sically Available, Soft state, Eventual consistency). Hence, 
in the case of master-slave replication, for each replica we 
propose the calculation of a consistency score. In the case 
of peer-to-peer replication, following each refresh operation 
(update, insert or delete propagation), we propose to calcu- 
late a refresh propagation ratio at each node, stated as number 
of not refreshed replicas / number of replicas. 


2.1.2 Data Completeness. It relates to the presence and absence 
of features, their attributes and relationships. Data completeness 
metric counts, for each dataset {R;} and each attribute {Aj;;}, the 
number of null values (i.e. missing values). 


e Data Fragmentation: If the dataset is fragmented horizon- 
tally the calculus of the data completeness of dataset {R; } 
is first performed for each horizontal partition and for each 
attribute, then all counts are summed up. 

If the dataset is fragmented vertically the calculus of the 
data completeness of dataset {R;} is first performed for each 
vertical partition, the overall counts are obtained with union 
all. 

If the dataset is fragmented horizontally and vertically (i.e. 
hybrid fragmentation), the calculus of the data completeness 
of dataset {R;} is first performed through vertical partitions, 
then horizontal partitions. 

e Data Replication: The data completeness metric is to be cal- 
culated for each replica. 


2.1.3. Data Accuracy. In [16], accuracy is defined as the extent to 
which data are correct, reliable and certified. It introduces the idea 
of how precise, valid and error-free is data. There are three main 
accuracy definitions in the literature: (i) semantic correctness that 
describes how well data represent states of the real-world [16]; (ii) 
syntactic correctness that expresses the degree to which data is free 
of syntactic errors such as misspellings and format discordances 
[11]; and (iii) precision that concerns the level of detail of data 
representation [13][12]. The ISO 19113 standard on geospatial data 
recommends the use of (1) positional accuracy -accuracy of the 
position features, (2) temporal accuracy -accuracy of the temporal 
attributes and temporal relationships of features and (3) thematic 
accuracy -accuracy of quantitative attributes and the correctness of 
non-quantitative attributes and of the classifications of features and 
their relationships. Data accuracy metric counts, for each dataset 
{Ri} and each attribute {Aj;}, the number of inaccurate values. 


e Data Fragmentation: If the dataset is fragmented horizontally 
the calculus of the data accuracy of dataset {R;} is first per- 
formed for each horizontal partition and for each attribute, 
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then all counts are summed up. 
If the dataset is fragmented vertically the calculus of the data 
accuracy of dataset {R;} is first performed for each verti- 
cal partition, the overall counts are obtained with union all. 
If the dataset is fragmented horizontally and vertically (i.e. 
hybrid fragmentation), the calculus of the data accuracy of 
dataset {R;} is first performed through vertical partitions, 
then horizontal partitions. 

e Data Replication: The data accuracy metric is to be calculated 
for each replica. 


2.1.4 Data Timeliness/Freshness. Data freshness introduces the idea 
of how old is the data: Is it fresh enough with respect to the user 
expectations? Has a given data source the more recent data? 
Timeliness [16] describes how old is data (since its creation/update 
at the sources). It captures the gap between data creation/update 
and data delivery. 


2.1.5 Data Accessibility. Accessibility denotes the ease with which 
the user can obtain the data analyzed (cost, timeframe, format, 
confidentiality, respect of recognized standards, ....). 


2.2 Query Metrics Model 


Our query model computes veracity metrics of the query resultsets 
on a given distributed dataset, based on the quality metrics model 
detailed in §2.1. For simplicity and without loss of generality, we 
assume that the quality metrics of the dataset are already defined 
by considering the distributed data characteristics (fragmentation 
and replication). We consequently consider for the rest of the paper, 
the distributed scenario where data is collected by a single entity, 
but stored and processed in a distributed manner for scalable query 
processing. 

Let’s consider the following notations: Given a big dataset D, let 
{Q;} for i € 1,...,n, be the set of queries on D, such that, 


e Each query Q; relates to a user profile, and each profile is 
described by multiple queries, 

e Each query Qj; involves a set of attributes {A;}, with j ¢€ 
1,...,m. For example Q; involves attribute set {A;, Az}. No- 
tice that a set of attributes can appear or not for different 
queries of different user profiles. That is each attribute can be 
relevant (or not) to each query and then to each user profile. 

e For each attribute set involved in Qj, there is a quality con- 
straint Cj. For example, The constraint C; corresponds to a 
completeness constraint on attribute A;; and the constraint 
C2 corresponds to a consistency constraint on {Aj, Az}. 

e Each quality constraint C; has a noise N; that corresponds 
to the percent of rows in the dataset D which violate C7. 


We propose a measure named veracity score (denoted V-score) which 
correlates the data quality metrics to the quality of queries’ result- 
sets. The veracity score is intended to capture for each query Qj, 
how much the noise through a quality constraint C; affects the 
veracity of its resultset and then the fitness for use of the attributes. 
The noise implied by each quality constraint can increase or de- 
crease the V-score according to a defined cleansing approach. Our 
query model implements three cleansing approaches: eager, custom 
and lazy approaches. We recall that the cleansing in our work is 
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performed at the query answering and not at the data sources. In 
the sequel, we describe the three cleansing approaches, 


(1) Eager virtual data cleansing: All rows in the dataset must sat- 
isfy all defined quality constraints. Hence, none of the quality 
constraints is violated in the cleansed dataset, and all the data 
metrics are satisfactory. Indeed, if a consistency constraint 
states that A; > 0, is violated, then all rows where A; < 0 are 
not considered in the query result whether the query uses 
the attribute A;, or not. This approach may discard rows of 
interest for queries, leading to a false query answer set. For 
that reason, we assume that every row deleted will affect the 
quality of the answer. 

The noise is considered with positive impact for all qual- 
ity constraints satisfied while the query is not using their 
involved attributes; and vice versa. 

(2) Custom virtual data cleansing: Quality constraints are checked 
if and only if the attributes within the constraints are used 
by the query (i.e. U(Q;, Ai) = 1). Hence, we discard all rows 
which violate any constraint involving A; attribute. 

(3) Lazy virtual data cleansing: none of the constraints is checked. 
Each query executes on the dataset as it is. This approach 
aims at saving filters’ execution. Notice that, the custom- 
cleansing approach might outperform this approach, since 
it runs over a filtered dataset. 


Summarizing, for a given query, custom-cleansing corresponds 
to the highest veracity score, while eager and lazy cleansing cor- 
respond to the lowest veracity score. The I/O cost affects query 
performance. Indeed, vertical fragmentation implemented by wide- 
column data stores reduces the number of read operations and trans- 
fers only required data to main memory for both custom and lazy 
data cleansing. Likewise data filtering -enabled by constraints ex- 
pressions for eager and customer cleansing outputs a small dataset 
performed prior to complex operations such as joins and grouping 
will also improve performance. We assume that attributes and con- 
straints are equal in veracity importance assessment. The model 
can be extended with custom weights assigned to constraints. 

We propose the formulas shown in Equation 1 and Equation 2 to 
calculate the veracity of a query with respect to a quality constraint, 
and for a selected approach. 


V — Scorejj(Qi, Cj) = Wapproach X Ne; (1) 
m 

V — Score(Q;) = dV — Score; ;(Qi,C;)) (2) 
j=l 


As explained earlier, we consider three approaches, consequently 
approach is either eager, custom, or lazy. Nc, denotes the noise that 
corresponds to quality constraint C;. We propose the following 
rule for assessing the veracity of each query with respect to each 
cleansing approach: If the query Q; doesn’t use attributes of the 
constraint Cj; or uses them with no impact on Qj’s answer set 
correctness; Then Weager = —1 (Eager approach applies constraints 
on data not retrieved for analysis, and this may alter the veracity 
of the query answer set), Weustom = 0 (Custom approach does not 
apply constraints on data not retrieved by the query), and Wjgzy = 0 
(Lazy approach does not apply constraints on data not retrieved by 
the query); Else Weager = +1 (Eager approach applies constraints 
on data retrieved by the query), Weustom = +1 (Customer approach 
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applies constraints on data retrieved by the query), and Wjgzy = —1 
(Lazy approach does not apply constraints on data retrieved for 
analysis, and this may alter the veracity of the query answer set). 


3 USE CASE: NYC TAXI DATASET 


In this Section, we consider the NYC taxi dataset to validate our 
proposed model. We first present this dataset and related data qual- 
ity metrics. Then, we accordingly evaluate the quality of taxi trips 
query answer set using motivating examples. 

Information associated with cabs’ trips can provide unprecedented 
insight into many different aspects of city life, from economic activ- 
ity and human behavior to mobility patterns. But analyzing these 
data presents many challenges. The data are complex, containing 
geographical and temporal data in addition to multiple variables 
associated with each trip [5] [10]. The New York City Taxi and 
Limousine Commission has released a detailed historical dataset 
covering over a billion individual taxi trips in the city from Jan- 
uary 2009. Each individual trip record contains precise location 
coordinates for where the trip started and ended (spatial data), 
timestamps for when the trip started and ended (temporal data), 
plus a few other attributes including fare amount, payment method, 
and distance traveled. We calculate other attributes such as geohash- 
pickup, geohash-dropoff, trip-duration (min), average-speed (mph) as 
well as the day-time hierarchy: year > month — day of week > 
hour. 


3.1 Quality Metrics of The NYC Taxi Data 


3.1.1. Data consistency. We propose the following constraints for 
cabs’ mobility data of New York City dataset, 


e Spatial constraints: The spatial data in the dataset relate to 
pick-up and drop-off longitude and latitude. Several records 
feature invalid longitude and latitude values -longitude range 
is -180°.. 180°, and latitude range is -90°.. 90°, are valid but 
not in the envelope of New York City. Moreover, maps’ plots 
show also cabs on the sea [10]. For 2015 yellow cabs data 
collection, over two millions and half of trips (1.75%) relate to 
not valid GPS coordinates (2,555,473 out of 146,112,989). GPS 
coordinates are checked whether they fall in New York City 
Minimum Bounding Box (MBB) or not (see first constraint 
in Table 1). 

Temporal constraints: Each file groups cabs’ trips of a given 
month. Consequently date-time columns in the dataset should 
conform to the file information. Trips’ durations must be 
positive. For yellow cabs trips in 2015, 1,215,064 trips out of 
146,112,989 trips feature a duration less or equal to 0 minutes, 
i.e. 0.83% (i.e. pickup-timestamp > dropoff-timestamp). 
Domain constraints: The maximum number of passengers 
is 5. For yellow cabs trips in 2015, 5,124,540 trips out of 
146,112,989 trips feature a number of passengers greater 
than 5, ie. 3.5%. 

The records with trip distance less than half a mile or greater 
than 50 miles are not valid. For trips in 2015, 6,479,521 trips 
out of 146,112,989 trips feature an erroneous trip distance, 
Le., 4.43%. 

Fare Amount: Some records have either a very high value 
of the fare amount or negative values. These records give 
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erroneous min, max and average values for estimating the 
cost of a trip. The percentage of negative or zero amounts is 
equal to 0.004% (65,301 trips) 


3.1.2 Data completeness. Yellow cabs trips reported in 2015 feature 
all data. In our framework, we propose to calculate for each attribute 
a noise value, as illustrated in Table 1. Then, for each constraint, 
we propose to compute an incidence matrice, to show whether 
the query uses or not the attribute set involved in the constraint 
definition (see Table 2). In Table 3, we detail how to measure the 
fitness for use of each approach, as well as approaches’ ranking. 


3.2 User profiles and business queries 


User profiles for cabs’ mobility data analysis fall into the following 
groups, 


e Social scientists who study cabs’ passengers’ preferences. 
Examples of queries include where cabs go on Friday and 
Saturday evening? 

e City planners who focus on roads’ capacities, tolls’ fees, 
Examples of queries include traffic jam detection. 

e Cabs’ dispatchers who optimize cabs dispatching for maxi- 
mizing profit. Examples of queries include: identify where 
cabs should be at a given time (day, hour, pickup location); 
or Compare trips with tolls fees versus trips without tolls 
fees with same pattern from-to and same time-interval 

e Customers who want to know min-max duration and cost 
of trips from one location to another. Examples of queries 
include average, minimum, maximum cost as well as average, 
minimum, maximum duration of a trip from a location to 
another on a given day and time. 


For each query, the focus will be put on whether quality metrics 
(the accuracy of spatial data, the completeness of data and the 
consistency) impact the veracity of query answer set or not. 


3.3. Workload Examples 


Next, we detail three queries, namely Q1, Q2 and Q3. For each query, 
we give its SQL statement and its results for different cleansing 
approaches respectively lazy cleansing, custom cleansing and eager 
cleansing. Google Big query shows the query processing time as 
well as the volume of data used for data processing. Notice that the 
Google Big query is an immutable, column-oriented store, and the 
more check constraints are enabled the more data is retrieved for 
calculating the query answer. Indeed, check constraints apply to 
other attributes. 

Query 1 calculates Top K frequent trip patterns (geohash-pickup 
- geohash-dropoff) (see Figures 1, 2). After discarding trips with 
geohash-pickup and geohash-dropoff that are not in the valid lon- 
gitude and latitude range, the resulset appears without 214,331 mis- 
leading trips (when enabling WHERE Clause in Figure 1). Hence, 
eager cleansing alters query resultset, degrades query performance, 
and presents a different result than a lazy cleansing. Notice that 
the elapsed time of the query with lazy cleansing is 5sec while the 
query with the WHERE clause is 6.7sec (eager cleansing). Also, the 
volume of retrieved data is 4.4GB for lazy and custom cleansing 
versus 10.9GB for eager cleansing. This confirms that wide-column 
data stores (such as BigQuery) is optimal for lazy and customer 
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Attribute set Constraints 


{latitude, longitude} for pickup and dropoff | C,: 40.491603 < latitude < 45.01785 and -79.762152 < longitude < -71.856214 


{pickup-timestamp, dropoff-timestamp} 


C2: dropoff-timestamp > pickup-timestamp 


Noise (%) 


{trip-distance} 


{fare-amount} 


C3: 0.5 mi < trip-distance < 40 mi 


{nbr-of-passengers} C4: 1 < nbr-of-passengers < 5 


Cs: fare-amount > 0 


{tolls-amount} 


C6: tolls-amount > 0 


Table 1: Noise degree of each attribute. 


Constraint 
Cy 
C2 
C3 
C4 
C5 
Co 
Table 2: Incidence Matrice (constraint-Query). 


cleansing, while it isn’t for eager cleansing. The latter performs 
all constraints check and retrieves consequently more attributes’ 
data. We notice that the answer set on the raw data (lazy cleansing), 
and the answers set is different for the three approaches. Indeed, 
enabling all constraints on geography, cost, time, trip attributes 
with incorrect values can affect the consistency of the answer set. 
The custom cleansing approach is preferable for this query. 

Query 2 Compares the cost and duration of trips for a given hour 
and day from a pick-up geographic zone to a drop-off geographic 
zone with and without tolls (see Figure 3). The answerset of this 
query is illustrated in Figure 4, and is the same for the three ap- 
proaches. But, the volume of retrieved data is different, we obtain 
8.7GB for lazy approach, 9.8GB, and 12GB for eager approach with 
all constraints enabled. The lazy cleansing approach is then prefer- 
able for this query. 

Query 3 calculates the following measures AVG, MIN, MAX over 
cost and duration of trips for given hour and day from a pickup 
geographic zone to a drop-off geographic zone (see Figure 5). We 
notice that the resultset on the raw data can have many errors in 
terms of accuracy and consistency (see Figure 6). Indeed, entries 
(e.g. time and cost attributes) with incorrect values (negative or 
zero values) can affect the computation of trip duration and trip 
cost, and then the consistency of the answer set. The data volume 
retrieved for lazy and custom approaches is 7.6GB, and is 10.9GB 
for the eager approach enabling all constraints verification. The 
customer cleansing approach is then preferable for this query. 


4 RELATED WORK 


In this Section, we overview related work which tackle data quality 
issues in general, as well as research papers which investigate data 
quality issues in particular domains. In [7], Fletcher proposes a 
framework that identifies data quality (DQ) problems in distributed 
computing systems (DCS), and how the relationship between data 
quality and DCS can impact decision making and organizational 
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performance. To present this relationship, Fletcher proposes cross- 
ing in a matrix DCS attributes (e.g. data replication, partitioning, 
etc) to DQ dimensions (e.g. accuracy, timeliness, etc). A cell in the 
matrix identifies a possible relationship between the DQ dimension 
and the DCS attribute. For example, the Accuracy-Data Replication 
cell indicates that accuracy would be impacted by data replication. 
The objective of [2] is to find dynamically the best trade-off between 
the cost of the query and the quality of the result retrieved from 
several distributed sources. For this purpose, Berti-Equille proposes 
a query processing framework for selecting dynamically sources 
with quality-extended queries with a negotiation strategy. The idea 
behind is merging the quality of service approach together with 
the quality of data techniques. From the quality of service domain, 
the author extends QML (quality of service modeling language) a 
query language for describing and manipulating data and source 
quality contracts (ex. accuracy, completeness, freshness and consis- 
tency). The result is a quality-extended query language (XQual) that 
presents in a flexible way, the specification of quality requirements. 
Saha and Srivastava [15] present challenges of data quality man- 
agement that arise due to volume, velocity and variety of data in 
the Big Data era. They present two major dimensions of big data 
quality management, (i) discovering/learning based on data and (ii) 
accuracy vs efficiency trade-off under various computing models 
(centralized and distributed computing models). 

Batini et al. [1] compare thirteen methodologies for data quality 
assessment and improvement along several dimensions. The com- 
parison dimensions includes (i) the methodological phases and steps 
that compose the methodology, (ii) the strategies and techniques 
that are adopted in the methodology for assessing and improving 
data quality levels, (iii) the data quality dimensions and metrics 
that are chosen in the methodology to assess data quality levels, 
(iv) types of costs that are associated with data quality issues (in- 
direct costs associated with poor data quality and direct costs of 
assessment and improvement activities), (v) types of data, (vi) types 
of information systems, (vii) organizations involved in data man- 
agement, (viii) processes that create or update data with the goal 
of producing services required by users that are considered in the 
methodology and (ix) the services that are produced by the pro- 
cesses that are considered in the methodology. 

In the literature, the following projects conducted on real datasets 
tackle data quality issues, 


e PEDSnet: a clinical data research network aggregates elec- 
tronic health record data from multiple children hospitals 


161 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


R. Moussa and S. Sahri 


Table 3: Running Example for calculating the fitness for use of each approach. Each triplet(x,y,z) corresponds respectively to 


the v-score of each approach eager, custom, and lazy. 


SELECT ST_GEOHASH(ST_GEOGPOINT( pickup_latitude , 
ST_GEOHASH (ST_GEOGPOINT(dropoff_latitude , dropoff_longitude) ,6) AS dropoff_geohash , 


COUNT(«) count_trips 


FROM bigquery—public—data.new_york. tlc_yellow_trips_2015 


WHERE (pickup_latitude between -90 and 90) 
AND (dropoff_latitude between -—90 and 90) 
AND (pickup_longitude between -—180 and 180) 
AND (dropoff_longitude between -—180 and 180) 
GROUP BY pickup_geohash, dropoff_geohash 
ORDER BY count_trips DESC 

LIMIT 10 


feo each Approaches’ ranking 


custom, lazy, eager 
custom, lazy, eager 


custom, lazy, eager 


pickup_longitude),6) AS pickup_geohash , 


Figure 1: Q1: Top 10 frequent pickup-dropoff locations, with none of the constraints in Table 1 enabled. 


to enable large-scale research, and is presented in [8, 9]. Au- 
thors propose a set of systematic data quality checks to be 
implemented and executed on the data. 

e ISMIR: Besson et al. examine quality issues raised by the de- 
velopment of XML-based Digital Score Libraries and propose 
a quality management model. 

e In [6] Firmani et al. choose a specific instance of such a 
type (notably deep Web data, sensor-generated data, and 
Twitters/short texts) and discuss how quality dimensions 
can be defined in these cases. 


5 CONCLUSION 


Summarizing, in this paper, we overview related work on data qual- 
ity assessment and motivate the consideration of different cleansing 
strategies appropriate for each business query on immutable and 
wide-column data store (Google BigQuery) where data is queried 
by different profiles and conflicting queries. We show that it is 
not appropriate to delete rows which violate quality constraints. 
For each query, one should compare the answer set for the three 
cleansing approaches. Future work is devoted to data profiling and 
preparation for adequate querying by the query engine in big data 
lakes and lakehouses’ architectures. 
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Figure 2: Q1: Top 10 trips per pick-up and drop-off geohashes answers set for respectively lazy (upper-left), custom (upper- 
right) and eager (lower-middle) virtual cleansing approaches. 
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WITH trips AS (SELECT :, 

EXTRACT (HOUR FROM pickup_datetime AT TIME ZONE "America/New_York") hour, 
EXTRACT (DAYOFWEEK FROM pickup_datetime AT TIME ZONE "America/New_York") day, 
TIMESTAMP_DIFF(dropoff_datetime , pickup_datetime , MINUTE) duration_min, 


ST_GEOHASH (ST_GEOGPOINT( pickup_longitude , pickup_latitude),6) AS pickup_geohash , 
ST_GEOHASH (ST_GEOGPOINT(dropoff_longitude , dropoff_latitude) ,6) AS dropoff_geohash 


FROM bigquery-—public—data.new_york.tlc_yellow_trips_2015 
WHERE (pickup_latitude between -90 and 90) 

AND (dropoff_latitude between -—90 and 90) 

AND (pickup_longitude between -180 and 180) 

AND (dropoff_longitude between -180 and 180) ) 
SELECT AVG(alpha.total_amount) AS avg_cost_no_tolls , 
AVG(beta.total_amount) AS avg_cost_with_tolls , 
AVG(alpha.duration_min) AS avg_dur_no_tolls , 
AVG(beta.duration_min) AS avg_dur_with_tolls 

FROM trips as alpha, trips as beta 

WHERE alpha. pickup_geohash = beta.pickup_geohash 
AND alpha.dropoff_geohash = beta.dropoff_geohash 
AND alpha.pickup_geohash = '‘dr72wn' 

AND alpha.dropoff_geohash = ‘dr5rvp' 

AND alpha.hour = beta.hour 

AND alpha.day = beta.day 

AND alpha.hour = 13 -——I1pm 

AND alpha.day = 1 -——Sunday 

AND alpha.tolls_amount = 0 

AND beta.tolls_amount > 0 


R. Moussa and S. Sahri 


Figure 3: Q2: Compare cost and duration for a given hour and day from ’dr72wn’ to ’dr5rvp’ with and without tolls, with none 


of the constraints in Table 1 enabled. 


Row avg_cost_no_tolls avg_cost_with_tolls avg_dur_no_tolls avg_dur_with_tolls 


1 32.8 45.41 25.0 27.0 


Figure 4: Q2: Compare cost and duration for a given hour and day from ’dr72wn’ to ’dr5rvp’ with and without tolls -the query 


answer set is the same for all approaches. 
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WITH trips AS (SELECT «, 

EXTRACT (HOUR FROM pickup_datetime AT TIME ZONE "America/New_York") hour, 
EXTRACT (DAYOFWEEK FROM pickup_datetime AT TIME ZONE "America/New_York") day, 
TIMESTAMP_DIFF(dropoff_datetime , pickup_datetime , MINUTE) duration_min, 


FROM bigquery—public—data.new_york.tlc_yellow_trips_2015 
WHERE (pickup_latitude between -90 and 90) 

AND (dropoff_latitude between -—90 and 90) 

AND (pickup_longitude between -180 and 180) 

AND (dropoff_longitude between -180 and 180) ) 

SELECT MIN(total_amount) min_cost , MAX(total_amount) max_cost, 
AVG(total_amount) avg_cost , MIN(duration_min) min_time, 
MAX(duration_min) max_time, AVG(duration_min) avg_time, 
count(«) count_trips 

FROM trips 

WHERE day = 3 —-—Tuesday 

AND hour = 3 ——3am 

AND pickup_geohash = ‘dr5ru7' 

AND dropoff_geohash = ‘dr5rue' 


ST_GEOHASH (ST_GEOGPOINT( pickup_longitude , pickup_latitude),6) AS pickup_geohash , 
ST_GEOHASH (ST_GEOGPOINT( dropoff_longitude , dropoff_latitude) ,6) AS dropoff_geohash 


Figure 5: Q3: AVG, MIN, MAX cost and duration for given hour and day from ’dr5ru7’ to ’dr5rue’, with none of the constraints 


in Table 1 enabled. 


Row min_cost max_cost avg_cost min_time max_time avg_time count_trips 


1 3.3 42.85 8.685347222222221 1 1438 8.422480620155042 


(a) Q3 Answer set for lazy cleansing. 


Row min_cost max_cost avg_cost min_time max_time avg_time count_trips 


1 3.3 42.85 8.685347222222218 1 1438 8.422480620155037 


(b) Q3 Answer set for custom cleansing. 


Row min_cost max_cost avg_cost min_time max_time avg_time count_trips 


1 43 42.85 8.739071440808633 1 1438 8.542744560561932 


(c) Q3 Answer set for eager cleansing. 


Figure 6: Q3: AVG, MIN, MAX cost and duration for given hour and day from ’dr5ru7’ to ’dr5rue’: required cleansing vs eager 


cleansing, for respectively lazy, custom and eager virtual cleansing. 
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ABSTRACT 


In this paper, we propose a method for predicting the quality of 
crowdsourcing workers using the goodness-of-fit (GoF) of machine 
learning models. We assume a relationship between the quality of 
workers and the quality of machine-learning models using the out- 
comes of the workers as training data. This assumption means that 
if worker quality is high, a machine-learning classifier constructed 
using the worker’s outcomes can easily predict the outcomes of 
the worker. If this assumption is confirmed, we can measure the 
worker quality without using the correct answer sets, and then the 
requesters can reduce the time and effort. However, if the outcomes 
by workers are low quality, the input tweet does not correspond to 
the outcomes. Therefore, if we construct a tweet classifier using in- 
put tweets and the classified results by the worker, the prediction of 
the outcomes by the classifier and that by the workers should differ. 
We assume that the GoF scores, such as accuracy and F1 scores of 
the test set using this classifier, correlates to worker quality. There- 
fore, we can predict worker quality using the GoF scores. In our 
experiment, we did the tweet classification task using crowdsourc- 
ing. We confirmed that the GoF scores and the quality of workers 
correlate. These results show that we can predict the quality of 
workers using the GoF scores. 
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1 INTRODUCTION 


Crowdsourcing is widely used in many situations, such as construct- 
ing training data for machine learning and information systems 
evaluations. Requesters give HITs (Human Intelligent Tasks) to 
workers, and the workers complete the HITs. Quality control is one 
main issue for crowdsourcing because not all workers process tasks 
honestly. 

Quality of outcomes by crowdsourcing is one of the most critical 
issues because many low-quality workers called spammers, who 
always do wrong behaviors [1]. The spammers aim to earn wages 
without effort. Therefore, some spammers do not read the tweets 
and randomly select one label for all tweets. Many researchers 
propose methods in order to avoid reducing the quality of outcomes 
by spammers. 

Majority voting (MV) is one simple and effective algorithm 
against spammers. The requesters assign more than two work- 
ers to one tweet. Then, the requester selects the most voted label. 
Therefore, if the labels selected by the spam workers do not win a 
majority, these labels are not selected. In this way, we can keep the 
quality of outcomes. 

However, the quality of the outcomes is low if there are so many 
spammers as the workers. Several algorithms are proposed based 
on MV and EM algorithms to improve the quality of outcomes, such 
as Dawid & Skene [2], GLAD [3], and a system by Raykar et al. [4]. 
If the number of workers assigned for each tweet is sufficient and 
averaging quality of workers is relatively high, the consensus of 
the voting is correct. However, we cannot calculate the aggregation 
result in real-time because the outcomes by the other workers are 
required to calculate majority votes. Moreover, the monetary cost 
increases if we assign many workers to each tweet. 

One feature of our approach is that we can identify spammers 
without using the outcomes of the other workers. Here we define 
worker quality. Worker quality value is the number of the correct 
votes to the number of all votes. We used majority voting for de- 
ciding these correct votes. However, using existing methods to 
calculate worker quality is time-consuming because we can calcu- 
late the worker quality values after processing all HITs. We cannot 
calculate the worker quality values during processing HITs because 
the outcomes of all HITs are required to process. Therefore, we need 
an algorithm for predicting these worker quality values without 
using the outcomes of all workers. 

Our approach is to identify spammers without using the out- 
comes of the other workers. We deal with the goodness-of-fit (GoF) 
of machine learning models for predicting the quality of workers. 
We used the accuracy score and F1-score of test data, which in- 
dicates the generation capability, as a goodness-of-fit. If workers 
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appropriately label tweets, the outcomes of workers can be accu- 
rately predicted by machine learning models that have sufficient 
expressive power because the spam workers who randomly select 
labels to the tweets do not correlate to the tweets. Therefore, if we 
use these tweets and the labels as a training data set to construct 
a machine learning-based classifier, the accuracy of a test data set 
should low. On the other hand, the input tweets and output labels 
of high-quality workers should correlate with each other, the clas- 
sifiers using the input tweets and output labels as training data 
should be high accuracy. 

In this paper, we consider multiple classification tasks of tweets. 
For example, if a requester needs to collect tweets related to COVID- 
19 using machine learning, the requester needs training data. There- 
fore, the requester constructs a task such that the workers select 
whether the tweets are related to COVID-19 or not. When the work- 
ers receive a tweet from a crowdsourcing platform, the workers 
submit a label from labels such as “related.” and “unrelated.” This 
paper considers multi-label text classification crowdsourcing tasks. 

We conduct experiments to confirm a correlation between the 
quality of workers and the value of goodness-of-fit of the machine 
learning model. In our experiment, we did a tweet classification task 
related to COVID-19 using crowdsourcing. We hired 151 workers 
and votes whether the tweets are related to COVID-19 or not. Using 
the tweets and the votings, we generate a classifier using BERT 
(Bidirectional Encoder Representations from Transformers) [5], a 
state-of-the-art machine learning model which can be used as a 
classifier. We did 10-fold cross-validation for each worker and calcu- 
lated the accuracy and F1-score as goodness-of-fit (GoF) scores. We 
also calculate an accuracy rate using majority voting and calculate 
the correlation between the GoF scores and the acceptance rate. 

The contributions of this paper are as follows: 


e We confirm that the values of goodness-of-fit for the machine 
learning model and the quality of the workers correlate to 
each other. 

e Using the goodness-of-fit scores, the systems can measure 
the quality of the workers. 

e More than 200 HITs for each worker are required to calculate 
accurate worker quality 


2 RELATED WORK 


As we described in Section 1, majority voting-based aggregation 
methods are proposed for improving the quality of outcomes. How- 
ever, when we use these methods for improving the quality of 
outputs, we should assign more than two workers to each HIT. 
Okubo et al. [6] discovered that more than four workers are re- 
quired for binary decisions accurately if there are more than half of 
the workers who always have correct answers. In these methods, 
the requesters should pay wages to at least four workers for each 
HIT; the wages of this task increase to keep the quality of outputs. 
Many researchers proposed methods for measuring worker quality 
in order to decrease wages. 

One possible solution is to capture workers’ behaviors for pre- 
dicting the quality of workers [7, 8]. Moayedikia et al. [9] proposed 
a method using EM algorithm. Rzeszotarski et al. [10] proposed 
a method for predicting worker quality from workers’ behaviors, 
such as task time and mouse movement, which can be observed 
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using Javascript. We [11] proposed capturing more behaviors by 
adding the other operations to workers. For example, when workers 
classify tweets, the workers should browse the tweets. We added a 
button to the work screen. The workers browse the tweets during 
the time the workers are pressing the button. 

The purpose of our method is to measure worker’s quality with- 
out using the other workers’ outcomes. The purpose of our research 
is similar to the research by Razeszotarski et al. However, these 
existing methods only consider workers’ behaviors, but they do not 
consider the workers’ outcomes. We believe that the outcomes of 
the workers should be treated as a behavior of the workers. To our 
best knowledge, this is the first time to predict worker quality using 
the outcomes of each worker. In our method, we try to capture 
features from the outcomes using machine learning models and 
confirm whether the features correlate the worker quality. Com- 
bining these methods with our proposed method, we will measure 
the worker quality accurately. 


3 METHOD 


In this section, we describe how to calculate GoF scores for each 
worker. We measure the worker quality by the following three 
steps: 


(1) Process HITs by workers 

(2) Construct machine learning model using the worker’s out- 
puts 

(3) Measure the value of goodness-to-fit by 10-fold cross-validation 


We deal with majority voting-based worker evaluation systems, 
which we call a point system to increase outcomes. This system is 
based on majority voting. The system assigns the tweets to multiple 
workers and aggregates outcomes of the workers using majority 
voting. If the label the worker select is a majority, the worker wins 
more than 1 points, but if the label is not a majority, the worker 
wins only 0.01 points. We explain this algorithm in Section 3.3. 


3.1 Process tasks by workers 


We prepare a set of tweets T = {ty, t2,--- , tn} using keywords. Our 
task aims to collect tweets related to COVID-19; we set keywords 
as “COVID” and “Corona” for filtering tweets. 

We hired a set of workers W = {wy, w2,--- , wy} using a crowd- 
sourcing platform. We did not select workers, and we did not mea- 
sure the qualities of workers using pre-tasks before processing our 
task. All workers who apply this task do our tasks. 

We prepared the following labels L = {Io, 11, l2,--- , 16} as follows: 


e Jo: The tweet is related to COVID-19. 
— 1,: The tweet includes only a fact which is publically avail- 
able. 
— Iz: The tweet includes only a fact which is not publically 
available. 
— 13: The tweet includes optionions of a twitter user who 
posts this tweet. 
— I4: I (the worker) cannot understand whether the tweet 
includes facts and opnions or not. 
e /;: The tweet is not related to COVID-19. 
e Js: I (the worker) cannot select from the labels Io, ly, --- , Is. 
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Figure 1: Overview of our system. Majority vote part is required for evaluating our proposed system, not required for production. 


Earned Pending 


Done Evaluated Pt. Pt. 


48 49 73.4 73.4 


Explanation End 


Caffeine 
G) 6:26 1/24/2020 


| want to buy a stock of companies related to 
COVID-19, but | can’t find which companies 
are related to. 


Related Unrelated Difficult to Judge 


Figure 2: Scrren capture of labeling tweets. Original work 
screen is written in Japanese. All descriptions are translated 
for convenience. 


When the workers select Jp, the workers select one labels from /,, 
Ip, 13, and I4. Therefore, lp is not used for labeling tweets. 

For example, if there is a tweet such that “New York city is locked 
down.,” the workers should select ];. However, if there is a tweet 
“New York city is locked down. I wanna go shopping but...” the 
workers should select /3, not 1; because this tweet includes both 
a fact and an unclear opinion. The workers should select /2 for a 
tweet “My father gets coronavirus twice.” We explained this policy 
to the workers with many examples before labeling tweets. 

We show an instruction of this task to all workers. This instruc- 
tion includes explaining labels with some examples and the point 
system described in Section 3.2. We did not explain how to calculate 
the points. 

Fig. 2 shows a screen capture image of labeling application for 
workers. There is a tweet for labeling at the center of this screen. 
The workers browse this tweet, then the workers first select one 
of three labels Jp, 15, and Ig. If the workers select Jo, the workers 
select one of four labels 1;, 2, 13, and I4. A set of outcomes V(w;) 
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by worker w; is as follows: 


Vwi) = {(ta lta, wi)), (tp, U(tp, wi)),° ++» (tee) (Ck ow) Wi) I 
(1) 
where tg, tp, +++ , tk(w,) € T are tweets which w; processes, k(w;) 
is a number of tweets which wj processes, I(tg, wj) € L is a label 
for tq by wj;. If w; process one tweet, the number of “Done” at the 
left top of Fig. 2 increments. 

The system should not pay wages to spammers in order to in- 
crease the quality of outcomes. However, if there are spammers, 
they process many tweets in a short time and earn wages a lot. To 
avoid this problem, we construct a point system that continuously 
measures the quality of workers. In the next section, we explain 
this system in detail. 


3.2 Earn Points 


To avoid spamming, we deal with the majority voting-based point 
system. In this system, the workers win points if the labels the 
workers select are the majority. To earn wages, workers must earn 
points. Therefore, if the workers are spammers, the points increase 
slowly, then they do not earn enough wages. 

Our system assigns five workers to each tweet. Then, the system 
receives five labels for each tweet. If the number of labels by ma- 
jority voting is less than half the number of all labels, the system 
assigns one more worker and receives a label. If the number of 
labels by majority voting is more than half the number of all labels, 
or if the number of workers is more than ten, the system stops 
assigning workers. 

Then, the system gives the points to the workers: 


e If the workers did not select labels by majority voting, the 
workers get 0.01 points. 

e If the workers select labels by majority voting, the workers 
get p points as follows: 


P= =i, (2) 
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where | is a majority label for the tweet, and |/,,| is a number 
of workers who select Im. >‘ |J;| is a number of workers who is 
assigned to the tweet. For example, if first five workers select the 
same label, the number of |I,,| and >" |1;| are the same, then the 
workers earn 1 point. 

The point is calculated when the system stops assigning workers 
to the tweet. That is, the point is not calculated immediately after 
labeling the tweet by the worker. The number appeared at the left 
top of the work screen shown in Fig. 2. “Done” means the number 
of HITs that the worker process, “Evaluated” means the number 
of tweets in which the point is calculated. The following number, 
“Earned Pt.’ means a total point the worker earns. The number 
“Pending Pt.” means the number of points that are not converted to 
the wages. If the number of Earned Point is more than a threshold 
which the requesters decide, the workers earn wages. 

Our system does not inform whether the labels which the work- 
ers select are the majority or not. 


3.3. Measure Gof scores 


GoF scores are the averaging scores of accuracy scores or F1 scores 
of the test set, which means how much the machine learning-based 
classifier fits the outcomes of the workers. We calculate the GoF 
scores for each worker using stratified 10-fold cross-validation be- 
cause the number of tweets for each label may become imbalanced. 
We construct a machine learning-based classifier which imitates 
the votes of the worker. We used BERT (Bidirectional Encoder 
Representations from Transformers) [5], one state-of-the-art text 
classifier, for building the classifier. 


We randomely divide V (w;) into 10 sets Vj (w;), V2(wj),- >> , Vio (wi). 


First, we set V;(w1) as a test set and the other sets as a training set. 
We construct a machine learning model and measure the accuracy 
a, and F1-score f{ using a test set. Next, we set V2(w1) as a test set 
and do the same process. We do this process 10 times and calculate 
aj, a2,°** ,ajo and fi, fo,--- , fio. 

Finally, we calculate the averaging score of accuracy A(w;) and 
that of f1-score F(w;). These values show the goodness-of-fit of 
the machine learning model to the worker w;’s outcomes. We used 
A(w;) or F(w;) as a quality value of workers. 


4 EXPERIMENTAL EVALUATION 


We did the experiments for two purposes: 


(1) Is it possible to measure the quality of workers using the 
values of goodness-of-fit of the machine learning model, 
such as A(wj;) and F(w;)? 

(2) How many HITs should the worker process for predicting 
the accurate quality of workers? 


We performed two experiments. Experiment 1 confirms whether 
A(w;) or F(w;) correlate to the worker qualities. Experiment 2 is to 
discover how many HITs the workers should process for calculating 
accurate GoF scores. 


4.1 Experimental Setup 


We collect 51, 218 tweets related to COVID-19 using Twitter Sam- 
pled stream v1 API! with keywords “COVID” and “Corona” All of 


‘https://developer.twitter.com/en/docs/labs/sampled-stream/ 
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Table 1: The number of tweets by majority voting and the 
number of votes 


label | # tweets | # votes 
Ly 12,148 72,917 
lo 4,829 56,605 
Iz 31,819 | 162,635 
l4 62 4,047 
ls, 4,754 55,872 
i. 27| 3,019 


all 50,696 | 355,095 


the tweets are posted between Jan. to May 2020 and written in Japan- 
ese. These tweets do not include retweets and replies. Tweets which 
have URL are also removed. We did not select tweets manually, then 
some tweets are not related to COVID-19 but include the string 
Corona or COVID. such as the tweets related to Corona beer or the 
tweets related to a video streaming service named niconicovideo 
because the string “niconicovideo” includes the string “COVID”” 

We constructed a tweet labeling system. We used Ruby on Rails 
6.0.3 * and PostgreSQL 13° as a backend system, and React* as a 
frontend system. This frontend system is suitable for smartphones; 
more than half of the workers use smartphones to process the task. 

We hired 195 workers from CrowdWorks ”, a major crowdsourc- 
ing platform in Japan. The system assigns between five and ten 
workers. In our system, the workers earn wages 200JPY ~ 2USD 
for every 500 points ~ 500 HITs. 

There is no gold data, which means there are no known priory 
answers. Therefore, we generate gold data using majority voting. 
Table 1 shows the number of tweets by majority votings. There 
are 2,943 tweets with more than two majority labels; we count 
the tweets at each label. The number of tweets labeled 14, and I 
means the workers cannot select the labels. Because the workers 
can decide labels for almost all tweets. Therefore, we ignore tweets 
labeled /4 and /¢ in our experiments. We constructed four-label (11, 
Iz, 13, and 5) classifiers for each worker to calculate the GoF scores. 
Then, we measure acceptance rate R(w;) which we consider worker 
quality as a gold data for each worker as follows: 


[V (wi) O MV | 
RO) = "Tow oe 
where |V(w;)| is the number of votes by w;. MV is a correct label 
by majority votings. |V(w;) A MV| is the number of votes labeled 
by majority voting, and wj; are the same. 
We constructed a machine learning model using BERT. We used 
pretrained Japanese BERT models and tokenizer®. We set the num- 
ber of epochs to 200 and the number of early stopping to 20. 


4.2 Experiment 1: Predicting worker quality 
using GoF scores 


Fig. 3 shows the results of this experiment. Each point corresponds 
to one worker. A Pearson correlation coefficient of accuracy is 0.23, 


“https://rubyonrails.org 
Shttps://www.postgresql.org 
“https://github.com/facebook/react 
*https://crowdworks.jp/ 
Shttps://github.com/cl- tohoku/bert-japanese 
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(b) F1-score 


Figure 3: Averaging accuracy score and averaging F1-score vs. acceptance rate. Each line shows a regression line, and each 
dotted line shows an averaging accuracy score and averaging F1-score of the classifiers with the labels by majority votes as 


training and test dataset. 


and that of the F1-score is 0.444. p-value of accuracy is 4.80 - 107° 
and that of F1-score is 1.16 - 107°, both values are lower than 0.05. 
From this result, we conclude that both accuracy score and F1 score 
correlate to the worker quality. Therefore, using either accuracy 
score or F1 score, we can predict worker quality accurately. Also, 
the F1 score is better than the accuracy score to predict worker 
quality values. 

We observed the results in detail, and we found that both the 
accuracy scores and the F1 scores are high if the number of HITs 
the worker processes is small, even if the workers are low-quality. 
We confirmed that we could identify low-quality workers if the 
workers do more than 200 tasks and the outcomes of the workers 
are different from the other good workers. In this experiment, the 
acceptance rate of 116 of 151 (77%) workers is more than 0.6, which 
can be considered good workers. 

We draw a regression line for each graph. In the graph of accu- 
racy score, the gradients and the intercepts of the regression line 
are 0.20 and 0.56, respectively. RMSE is 0.12. In the F1-score, the 
gradients and the intercepts are 0.29 and 0.62, respectively. RMSE 
is 0.14. From this result, we also confirmed that the F1-score is 
better than the accuracy score because the gradients of the F1-score 
are higher than that of the accuracy score, then we can identify 
high-quality and low-quality workers using the F1 score. 

In this experiment, we also construct a classifier that can output 
labels by majority voting. We prepared input texts, and the labels 
correspond to each text by majority voting. Then, we calculate the 
accuracy scores and the F1 scores by 10-fold cross-validation. We 
show the accuracy score and F1-score as dot lines in each graph 
of Fig. 3. The accuracy score is 0.668, and the Fl-score is 0.826. 
From this result, we confirmed that predicting labels by majority 
voting is relatively tricky than predicting labels by each worker. 
Also, the accuracy of the classifiers which output accurate labels is 
impractical. 
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Table 2: Workers’ accuracy 


Worker | Category | # votes | R(wi) | A(wi) | F(wi) 
wW1 TP 4,128 | 0.787 0.793 | 0.922 
w2 T™N 3,262 | 0.253 0.101 0.227 
W3 FP 5,809 | 0.500 0.733 | 0.778 
wW4 FN 830 | 0.804 0.742 | 0.785 


4.3 Experiment 2: A number of votes for 
accurate prediction of worker quality 


In this experiment, we discover how many votes we need to predict 
accurate worker quality. We pick up four typical workers from the 
following four categories: TP) Both F1-score F(w;) and acceptance 
rate R(w;) are high, TN) Both A(w;) and F(w;) are low, FP) F(wj;) 
is high and R(w;) is low, FN) F(w;) is high and R(w;j) is low. Then, 
we pick up 50, 100, 150, --- votes; then we perform 10-fold cross- 
validation, then we calculate accuracy scores and F1-scores as GoF 
scores. As shown in Fig. 3b, no worker is in the group FN. Therefore, 
we choose the worker who has the lowest acceptance rate with the 
highest F1-score. Table 2 shows the acceptance rate, accuracy, and 
F1-score of the workers we choose. 

Fig. 4 shows time-series variation of the accuracies and F1-scores 
of each worker. From these figures, we can observe that the accuracy 
score and F1 scores of all workers are high if the number of votes 
is small. However, when the number of votes is more than 200, the 
scores of low-quality workers become low. Therefore, the workers 
should do more than 200 HITs to calculate accurate GoF scores. 

However, our proposed system incorrectly identifies several low- 
quality workers as high-quality, like worker w3. These workers are 
shown at the left-top part of the graph in Fig. 3. The number of 
workers in this field is relatively small. We check the outcomes of 
these workers in detail and found that these workers do HITs in 
good faith but misunderstand the instructions of this task. In our 
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Figure 4: Time series valiation of Number of votes vs. Accuracy and F1-score 


task, as we described in Section 3.1, if there is a tweet that includes 
both a fact and an opinion, the workers should select /3: the tweet 
includes opinions of a Twitter user who posts this tweet. However, 
these workers select [: the tweet includes only a fact. Therefore, 
these workers should be considered as low. The correspondences 
of requesters for these workers and that of the workers randomly 
select the options should differ. We should construct a system for 
classifying these different types of low-quality workers. 


5 CONCLUSION 


In this paper, we proposed a method for predicting the quality of 
workers using the GoF (goodness-of-fit) scores of machine learning 
models. We consider accuracy scores and F1 scores as the goodness- 
of-fit scores. We discover whether there is a correlation between 
the goodness-of-fit scores and the acceptance rate of the workers. 
We did the crowdsourcing task and measured the goodness-of-fit 
scores. 
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We performed an empirical evaluation on detecting high-quality 
and low-quality workers in a Twitter annotation task. As a result of 
experiment 1., we found that both the accuracy score and F1 score 
are correlated to the worker quality, and the F1 score is better than 
the accuracy score for measuring worker quality. From experiment 
2., we found that at least 200 votes are required to calculate accurate 
GoF scores. 

In future work, we plan to integrate our proposed worker quality 
measurement system into a crowdsourcing platform. Unlike the 
acceptance rate, the platform can calculate F1 scores in a short 
time. Therefore, the platform can quickly kick out the low-quality 
workers, and the requesters save wages and increase the quality of 
outcomes. 

Moreover, we should integrate our method with the other worker 
quality prediction methods based on worker behaviors, introduced 
in Section 2. Our method and these methods have a complementary 
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relationship. In our method, several low-quality workers are classi- 
fied into high-quality workers. When we combine several methods, 
the accuracy of this classification will increase. 

In our experiment, we deal with the worker evaluation system 
based on majority voting. This system is used because we do not 
understand how the GoF scores are helpful. We can develop another 
point system using GoF scores. However, this system can be a 
blackbox; the requesters cannot understand why the workers are 
considered as low-quality even if the workers’ outcomes are high 
quality. To solve these problems, we should develop a method to 
understand this blackbox. 
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ABSTRACT This paper proposes data storytelling systems that use SuperSQL, 


SuperSQL is an extended SQL language, which brings out a rich 
layout presentation of a relational database with a particular query. 
This paper proposes SSstory, a storytelling system in a 3D data 
space created by a relational database. SSstory uses SuperSQL and 
Unity to generate a data video and add cinematic directions to the 
data video. Without learning special authoring tooling, users can 
easily create data videos with a small quantity of code. 
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1 INTRODUCTION 


Nowadays, with a mass of data created, how to tell the insights 
of data to the audience is becoming critical. Therefore the data 
storytelling approach to telling the data insight as a story becomes 
a topic in data visualization. 

Previous research[1] named SSQL4.5DVS uses Unity[2], a game 
development platform, to implement a system that can generate 
objects with data in a 3D space. With SSQL4.5DVS, even users 
without knowledge of Unity can generate visualized scenes in the 
3D space. 

However, SSQL4.5DVS is just a visualization tool which not 
support the function of data storytelling. Also, the authoring tool for 
the data storytelling existing but such tool takes time for learning 
or needs specialized knowledge and technology. 
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an extended SQL language, and Unity to make a data video in the 
3D space. 


2 RELATED WORK 


Several papers inspired our paper. Edward et al.[3] looks at the 
construction of data storytelling and analyzes the element. This 
paper refers to the structure of data storytelling. Fereshteh Amini et 
al.[4] analyze 50 of professionally designed data videos, extracting 
and exposing their most salient constituents. Also, Fereshteh Amini 
et al.[5] develops a tool named DataClips which can generate data 
video with the data clips. Users can easily do it without using a 
video authoring tool like Adobe AfterEffects. Ren[8] looks at the 
importance of annotation in data storytelling and develops a spe- 
cialized data storytelling tool for the annotation. Edward et al.[6] 
research the element of stories then sum up the design pattern of 
the data story with the result. Lyu et al.[7] shows the communica- 
tion model of storytelling and introduces a visualization works, a 
Chinese painting, using the big meteorological data in China. Amin 
Beheshti et al.[9] combines the intelligent DataLake technique and 
data analytics algorithm to implement an interactive storytelling 
dashboard for social data understanding. Shi et al.[10] develop a tool 
that automatically generates data stories from Google Spreadsheet. 

However, these related works expend time costs to generate data 
stories. Also, most of them present the data story as a 2D graph; it 
is hardly any approach for data storytelling in 3D space. 


3  SSstory 


This paper proposes an approach that combines the StoryGenerator 
and the StoryEditor. StoryGenerator uses SuperSQI and Unity to 
visualize the data from the database, and the StoryEditor adds the 
story elements to the work of visualization. The presentation of 
the data story will be a data video, which means show the data as a 
video or animation. 


3.1 High-dimensional visualization object 


A High-dimensional visualization object can show high-dimension 
information. For example, an object like a human face can allocate 
the height of the nose, the width of the mouth, the color of the eye 
to high-dimension information. 

We can create a High-dimensional visualization world with sev- 
eral High-dimensional visualization objects. The advantage of the 
High-dimensional visualization object and High-dimensional vi- 
sualization is that the audiences can choose the information they 
want to know. 
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Figure 1: Asset of Unity 


Figure 1 shows a result that uses the asset of Unity to create a 
world just like a farm[14]. In this world, the strength of wind, the 
strength of sunlight, the number of animals, the amount of grass 
can have information from the database. 


3.2 About SuperSQL 


SuperSQL is an extended language of SQL that structures the out- 
put result of the relational database, enables various layout ex- 
pressions, and has been developed at Keio University Toyama 
Laboratory[11][12]. The query replaces the SQL SELECT clause 
with a GENERATE clause with the syntax of GENER-ATE <media> 
<TFE>, where<media>is the output medium and can specify HTML, 
PDF, etc. Also, <TFE> represents a Target Form Expression that is 
an extension of the target list and is a kind of expression having a 
layout specifying operators such as a connector and an iterator. 


3.3. Architecture 


Figure 2 shows the architecture of the SSstory. The system consists 
of the StoryGenerator and the StoryEditor. The StoryGenerator 
consists of the Parser that distinguishes the SQL query and layout 
presentation, the DataConstructor that layout the data from the 
result of the database and structured information of the table, the 
CodeGenerator to generate XML file according to the structured 
data. In the StoryEditor, the Timeline and Cinemachine package 
is used to improve the quality of DataVideo. This system aims to 
generate 80% of the material by StoryGenerator and finish the video 
with the rest 20% adjustment by StoryEditor. 

Then we will show the workflow of this system. First, store the 
data user wants to visualize in the database. When the SuperSQL 
query is executed, the XML file and C# file are generated based on 
the data and the query contents. The information of the object and 
the elements given on the object and the layout are described in 
the XML file. C# file reads its XML file and creates objects, gives 
color and animation for objects, and determines the position of each 
object from layout information. By importing these two script files 
into Unity and executing it, we can realize the data visualization 
specified in the query. In the case of using assets, we also need to 
import assets to Unity at the same time. Then in the StoryEditor, 
the Cinemachine and AnimatorController will create the animation 
element through the XML file. Then the user can use the Timeline to 
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Figure 2: Architecture of SuperSQL 


choose the activated camera and adjust the video sequence. Figure3 
shows the Timeline used by StoryEditor. 


3.4 Object Generation Function 


This system uses the asset function to create objects which property 
likes size can be assigned with the argument. The system supports 
three types of object generation functions. 
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Figure 3: Timeline of Unity 


Figure 4: Generation example of object(torus) 


Save 


Figure 5: Generation example of asset(panda) 


3.4.1 Primitive Generation Function. User can create primitive ob- 
ject like cube, sphere, torus, cuboid, pyramid by using this function. 


object(ob ject, argument1, argument2) 


The following shows a description example of the object function 
for generating an object (torus) and its generation result. 


ob ject(‘torus’,r1,r2) 


Also the 3D text object could be generated though the primitive 
generation function 


object (‘text’, text, size) 


3.4.2. Asset Generation Function. Asset generation function uses 
the asset, an existed object created by the user. 


asset (asset_name, size) 


Figure5 user can generate an asset called Pandan through the 
following description. 


asset(’Panda’, size) 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


3.4.3. Random Generation Function. The random generation func- 
tion will generate objects from a folder. All the assets in the folder 
will be put in an array and wait to be generated. By default, assets 
will be generated in the X-Y plane with normal distribution, and 
the max coordinate is according to the scene asset. 


random(asset_folder, number, min_size, max_size) 


For example, this query can generate all assets in the folder of 
Animal according to the argument. If the user wants to generate 
them in the whole X-Y-Z space, not the X-Y plane, the user can 
write a decoration(@) for the function. 


random(’Animal’, number, min_size, max_size)@{space =’ X-Y-Z’} 


3.5 Connector 


Table 1 shows the operator supported by the SuperSQL. The con- 
nector is an operator that specifies in which direction (dimension) 
the data obtained from the database is to be combined. There are 
the following three kinds. The character used to represent each 
connector in a SuperSQL query is noted in parentheses. 


e Horizontal Connector(,). 
Two pieces of data connected by this connector are arranged 
next to each other horizontally (along with the x-axis). In a 
flat document, this means the two pieces of data are shown 
on the same line. In a 3D document, this connector acts as 
shown below. 
ex: obj1, obj2 


e Vertical Connector‘(!). 
Two pieces of data connected by this connector are arranged 
next to each other vertically (along the y-axis). In a flat 
document, this means the two pieces of data are shown 
on two separate lines. In a 3D document, this connector acts 


as shown below. 
= 


ex: obj1! obj2 
obji 


e Depth Connector(%). 
Two pieces of data connected by this connector are arranged 
next to each other in-depth (along with the z-axis). This 
means the first piece of data links to the second piece of data 
in a flat document. In a 3D document, this connector acts as 
shown below. 
ex: obj1% obj2 

e Time Connector(#). 
Two pieces of data connected by this connector are arranged 
next to each other in the time axis. It will be shown at a con- 
stant time interval. Following description shows an example 
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Ps 
obji 


for Time Connector. 
ex : [country, city]##[building]# 


Pantheon 
Colosseum| 


erica 


Italy Rome 


3.6 Grouper 


The second type of operator available in SuperSQL is the grouper. 
When a grouper is used, the attribute or group of attributes affected 
by the operator is repeated in the document for each tuple retrieved 
from the database by the query. The following groupers are available 
in this system. 


e Horizontal Grouper([ ],). 
An object or group of objects is added to the document for 
each tuple in the relation retrieved by the query, each object 
is arranged horizontally. In our 3D system this produces the 
pattern shown below. 
ex : [Obj], 


ne 


Vertical Grouper([ ]!). 


An object or group of objects is added to the document for 
each tuple in the relation retrieved by the query, each object 
is arranged vertically. In our 3D system this produces the 
pattern shown below. 

ex : [Obj]! 


Depth Grouper([ ]%). 

An object or group of objects is added to the document for 
each tuple in the relation retrieved by the query; each object 
is arranged in depth. In our 3D system, this produces the 
pattern shown below. 

ex : [Obj]% 
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obj1 


e Timeline Grouper([ ]#). 
The timeline grouper will arrange scenes to a timeline and 
present the data as a data video. 
ex : [country, city]# 


America| |New York| 


Japan| | Tokyo BE 


America} | Chicago 


Italy Rome 


e Slideshow Grouper([ ]&). 
Slideshow grouper will arrange scenes like a slideshow with 
a UI button. Click the button can translate to another scene. 
Chapter 4.3 will show a example for slideshow repeat. 


Table 1: Operator in SuperSQL 


tfe1, tfe2 | Horizontal connector 
tfe1! tfe2 Vertical connector 
tfe1% tfe2 Depth connector 

[t fel], Horizontal grouper 
[t fe]! Vertical grouper 
[tfe]% Depth grouper 

[t fe]# Timeline grouper 


3.7. Environmental Function 


As a result of the environmental function, the information of the 
environmental object will be assigned. The environmental object is 
a kind of object that unique exist in the scene. For example, there 
is only one sun in the scene, and it will be an environmental object. 
Through the following query, the user can vary the strength of 
sunlight along the timeline. 


env(object, component, attribute, value) 


When the environmental function is executed, an empty GameOb- 
ject called EnvController will be generated in Unity, and the script 
of the EnvController will fetch the environment object and its com- 
ponent, assign a value to the attribute. 


enu(’Sunlight’,’ Light’,’ intensity’, value) 


This function can assign the value of for the Sunlight object to 
change the lighting in the scene. 
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Figure 6: Example of annotation 


3.8 2D Element Function 


This system supports two types of 2D element functions. 2D element 
means an object that exists in the canvas. Any connector will not 
influence its coordinate so that it could be written in any place 
in the code. The 2D element also could be repeated by a timeline 
grouper to generate an animation. 


3.8.1. Annotation Function. An annotation function will be used to 
create an annotation for canvas. 


annotation(type, text) 


The first argument shows the type of annotation. There are some 
presets for the annotation like "text_bottom,’ which means a text 
box at the bottom of the canvas. 


annotation('text_bottom’,’ some annotations’ ) 


Figure6 shows the result of this query. Also, you can use deco- 
ration to complete the detail of annotation such as width, height, 
font size, texture, and highlight. 


annotation(’text_bottom’,’ some annotations’) @{font_size =’ 20’} 


Fig. will show the result of this query and its decoration. Another 
valuable decoration for the annotation function is setting the start 
time to achieve a better presentation for the data video. 


annotation(‘text_bottom’,’ some annotations’)@{start_time =’ 5’} 


The annotation will appear after 5 seconds, according to the 
decoration. Fig shows a combination of three annotation functions. 


[annotation(’text_bottom’, data) |# 
The timeline grouper with the annotation function can make the 


annotation to be an animation annotation to show the data. 


3.8.2 Chart Function. Chart function will be used to create charts 
in the canvas. The first argument shows the type of chart. Some pre- 
sets of charts have been prepared. The data of the chart is assigned 
by the second and the third argument. 


chart(type, data1, data2) 
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4:41PM 4:41PM 4:41PM 4:41 PM 4:41 PM 4:41PM 4:41PM 4:41PM 4:41PM 


Figure 7: Graph And Chart 


We use the asset of Unity called ’Graph and Chart’[13] to implement 
this function. The table shows the supported type of chart in the 
chart function. Figure7 shows a sample of Graph And Chart asset. 


[chart(type, data1, data2)]|, [chart(type, datal1, data2)|! 


Both horizontal grouper([],) and vertical grouper([]!) will repeat 
the data and make it be a whole chart. The graph shows the grouper 
used for the chart function. 


[chart(type, data1, data2)|# 


The timeline grouper can make the chart to be an animation chart. 

Like the annotation function, you can use the decoration to 
adjust the chart, like width, height, axis label, and approximate 
curve. 


chart(type, data1, data2)@{approximate_curve =’ linear’} 


Two charts can be combined as one chart by using the mark of 


’ ’ 


F os 


chart(’Bar’, data1, data2) + chart(’LineGraph’, data3, data4) 


3.9 Optional function 


The optional function is a function used to add an extra attribute 
to the object generation function. This section will introduce the 
optional function and show some examples. 


function(< TFE >, parameter) 


The first argument <TFE> means the object generation function 
with its connector and grouper. The second argument shows the 
extra attribute given to the TFE. 


e Hop Function 
The hop function can cause an object to jump by describing 
velocity and vertex and axis direction. The following shows 
a description example of hop function and its image. 


hop(< TFE >, velocity, vertex, axis) 
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e Pulse Function 
The pulse function can stretch the object by describing mag- 
nification and velocity. The following shows a description 
example of pulse function and its image. 


pulse(< TFE >, magnification, velocity) 


Rotate Function 

The pulse function can rotate the object by describing the ro- 
tation rate for three axes. The following shows a description 
example of pulse function and its image. 


rotate(< TFE >, X_rate, Y_rate, Z_rate) 


e Color Function 
The color function gives an object the color by describing 
with a character string. Available colors include "red", "blue", 
"green", "black", "clear", "cyan", "gray", "yellow", "white", "ma- 
genta". The following shows a description example of color 


function. 
color(< TFE >, color_name) 


Position Function 

The position function is a valuable function to change the ab- 
solute coordinate for the object. It original point(0, 0, 0) of 3D 
space as a center and update the coordinate after generation. 


position(< TFE >, x_coordinate, y_coordinate, z_coordinate) 


e Move Function 
The move function can change the relative coordinate for the 
object. The move function is different from other functions. 
If a move function is described in a move function again, the 
coordinate will be calculated through all the move functions. 


move(< TFE >, x_coordinate, y_coordinate, z_coordinate) 


e Optional Annotation Function 
The optional annotation function is a function to assign 
annotations to an object. Unlike the annotation function in 
the section, which generates annotation in the canvas, the 
optional annotation function adds some annotations to an 
object. The user can get the annotation by clicking the object 
in the game mode. 


optional_annotation(< TFE >, text) 


Camera Function 

Camera and cinema direction play an important role in data 
storytelling. This system supports the camera function to 
generate a camera in the scene. The type of camera including 
follow camera, a camera that follows an object, a fixed cam- 
era, a camera in a stable position, and FPS camera, a camera 
to show the first-person view of an object. 


camera(< TFE >, type) 
Figure8 shows a follow camera for a deer asset. 
camera(asset(’Deer’,1),’ Follow’) 


Moreover, the camera function can also be repeated by grouper 
to generator a camera for each object. Decoration also can 
be used to change the attribute of the camera, such as lens, 
damping, and coordinate. 
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Figure 8: Example of follow camera 


3.10 Query 


In general, the query of storytelling will be the following expression. 


GENERATE unity_dv 
Scene(scene_name) 

Loption to object( 

several functions(...), necessary 
< parameter) (grouper) (connector) 
(environmental function) 

(2D element) 

FROM table 

WHERE condition 


(First element)Describe the scene used in this system. 

(Second element)SSstory supports several functions to assign 
the information to objects. Decide which element assign to the ob- 
jects through the function name, then describe the object generate 
function or other function in the first argument. It is necessary to 
describe the argument in the correct order. The connector will be 
used to connect another object here. Each of the objects will be the 
same expression rule. 

(Third element)Assign information to environmental objects 
through the environmental function. There is only one environ- 
mental function for an environmental object in a scene. 

(Fourth line)Describe 2D elements in the canvas. Other 3D ob- 
jects would not influence their coordinate in the second element. 

The first and second element is must, and the third and fourth 
element is optional. 


3.11 Implementation 


This section will show the implementation of StoryGenerator. 


3.11.1 XML File Generation. The XML file will be created by the 
SuperSQL through the "*.ssql" file. 

Then we will explain Algorithm1. First, input the tree structure 
data, layout formula through the query in the *.ssql’ file. If the 
CodeGenerator in the SuperSQL reads the media name of Unity_dv, 
it will add an XML tag to the output. Then, SuperSQL will read 
the grouper and generate the grouper node with the attribute of 
£1(,), g2(!), g3(%), g4(#), g5(&). Next time, read the child node of 
the grouper, generate the connector node with the attribute of c1(,), 
c2(!), c3(%), c4(#). The object around the connector will be a child 
of the connector node. 
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Algorithm 1 XML File Generation 


Require: tree structure data, layout formula 
Ensure: XML file 

1: add XML tag 

2: for all operator do 

3: if there is the grouper then 


4 create grouper element 
5: endif 
6: if there is the connector then 
7: create connector element 
s. endif 
9: while the function has function in argument do 
10: save information of current function 
11: read next function 
12: if object generation function then 
13: create object element 
14: add saved optional function information as child node 
of object 
15: end if 
16: end while 
17: end for 


18: return XML file 


Then read the optional function of the query. If the function is 
an object generation function, creating the object or asset node. 
Then read the optional function, add the information to the object 
or asset node until the last function. 


3.12 Object Generation 


The Algorithm2 shows how the object is created in Unity. First, 
the XML reader will read the XML file for all elements. If there 
is a grouper element, generate the grouper object, a wrapper for 
the child object, and read its child element. Then if there is a con- 
nector element, generate the connector object, a wrapper for the 
child object, and read its child element. When reading an object 
generation element, instantiate an object by the information of the 
object element. Then if the object element has an optional function 
as a child, add the information to the object until the last child. 


4 USE CASE 


This chapter shows the use case of storytelling with Unity and 
SuperSQL. 


4.1 Sample Data 


The use case uses the weather data about nine dimensions of Tokyo 
2020[16] and an asset of farm[14]. Table 2 shows the correspond- 
ing relationship between data information and asset property. The 
environment of the scene is decided through the following descrip- 
tion of an environmental function, and Table 3 shows the several 
environmental functions supported by the system. 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


Algorithm 2 Object Generation 


Require: XML file 

Ensure: Object with function information 
1: read XML file 
2: for all XML element do 
3: if there is the grouper element then 


4: create grouper object 
5: endif 
6: if there is the connector then 
7: create connector object 
s. endif 
9: while the element has child do 
10: if object generation function then 
11: create object 
123 end if 
13: if optional function then 
14: add optional function information to the object 
15: end if 
16: if environment function then 
17: generate environment manager 
18: find the environment object and assign value 
19: end if 
20: end while 
21: end for 


22: return Object 


Table 2: Corresponding relationship of information and as- 
set 


Data information Asset property 


air pressure strength of lighting 


max precipitation size of grass 


average precipitation 
average temperature 
min temperature 


amount of grass 
number of animals 
max temperature max size of animals 
humidity velocity of the waterfall 
number of birds 
(higher air pressure more bird) 


time of sunlight 


Table 3: Environmental Function 


Function Asset property 


env(’waterfall’, rotate’ , velocity’, value) | velocity of waterfall 


env(’windmill’, rotate’, velocity’, value) | velocity of windmill 


strength of sunlight 
size of grass 


env(‘sunlight’, ‘light’, ’intensity’, value) 
env(grass’, transform’, ’scale’, value) 
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4.2 Query Example 


The following query is executed, and the chapter shows the result 
of it. -QueryExample 


GENERATE Unity_dv 

scene(farm), 

Lrandom('animal', w.tempera_average, w.tempera_min, 

— w.tempera_max), 
assert('bird', w.pressure), 
optional_annotation(env('waterfall', 
+  '‘ParticleSystem', 'speed', w.humidity), ‘The 
— humidity of' || w.month || ‘is' || 
— w.humidity), 
optional_annotation(env('Sun', ‘Light’, 
— '‘intensity', w.sunlight_time), 'The sunlight 
<> time of' || w.month || ‘is' || 
+ w.sunlight_time), 
optional_annotation(env('grass', 


—  'ParticleSystem', 'max_size', w.precip_max), 
— ‘The max of precipitation of' || w.month | | 
—+ '‘is' || precip_max), 
optional_annotation(env('grass', 

~  'ParticleSystem', 'max_particles', 

+ w.precip_total), ‘The total of precipitation 
+ oOf' || w.month || ‘is' || w.precip_total), 
optional_annotation(env('wind', ‘rotate’, 

— 'velocity', w.wind), ‘The strength of Wind 
+ oOf' || w.month || ‘is’ || w.wind), 

w.month 


1& 


from weather_tokyo w 


The scheme used in this query shows the following. 
-weather(id, pressure, precip_max, precip_total, tempera_average, 
tempera_min, tempera_max, humidity, wind, sunlight_time); 


4.3 Execution Result 


This chapter shows the result of execution. Figure 9 shows the 
dataset of August, and Figure 10 the dataset of December. You can 
see that animal of August is bigger than in December; also, the 
sunlight is more bright. 


Figure 9: Dataset of August 
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Figure 10: Dataset of December 


Annotation 


Figure 11: Example of slideshow repeat 


e Annotation 
Annotation can be assigned to an object through the optional 
annotation function shows in Figure 13. 

e Slideshow repeat. 
Figure 11 shows the example of the slideshow repeat. Group- 
ing with the month, then scenes and corresponding UI but- 
tons of the month will be generated. 


No one go out in May! 


Annotation 


Figure 12: Annotation for scene 
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The strength of wind of May is 
3.1m/s 


Figure 13: Annotation for object 


4.4 Another Presention 


SuperSQI can alter the presentation easily by just rewrite the query. 
For this use case, we chance the corresponding relationships of air 
pressure and sunlight time. Higher air pressure will show a more 
bright light in a new presentation, and longer sunlight time will 
bring more birds. 


Figure 14: Dataset of August after 


Figure 15: Dataset of December after 


Figure 14 and 15 shows the new presentation of the same dataset. 


It is known that December is more bright than August because 
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of the higher air pressure. In SSstory, creators can improve their 
presentation quickly to tell the message expectedly. 


4.5 Another Dataset 


SSstory can use the same asset to present other datasets. As an 
alternative, we use the data of C-VOID19 of 2020 in Tokyo, Japan. 
Table4 shows the corresponding relationship in a new dataset. 


Table 4: Corresponding relationship in Covid-19 dataset 


Data information Asset property 


strength of lighting 
Hewiy amected cases (less people more bright) 


total severe cases size of grass 

newly death cases amount of grass 

total vaccine inoculations number of birds 

newly vaccine inoculations max size of birds 
-QueryExample2 


Generate Unity_dv 

scene('Farm' ) 
,random('Bird', c.total_vaccine/100000, 0.1, 
+ C¢.newly_vaccine/5@0000 +0.5) 
,env('Directional Light', 'Light', ‘intensity’, 
+ (8000 - c.newly_infection) /3000) 
,»env('GrassA', '‘ParticleSystem', 'startSize', 
— c¢.newly_death/2) 
,env('GrassA', 'ParticleSystem', 'maxParticles', 
~ c.total_severe*2) 
!annotation('textbottom', c.date || ',' || ‘newly 
+ infection’ || ':' || e.newly_infection) 


from covid19 c 
where c.date = '2021-3-22' 


Save 


Figure 16: Dataset of 22nd March 2021 about Covid-19 
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_— ~~ st 
Save 


Figure 17: Dataset of 22nd April 2021 about Covid-19 


Figure 18: Dataset of 22nd May 2021 about Covid-19 


Figure 16 and 17 show the result of visualization. New infected, 
severe cases, death cases were increased from March to April 2021. 
Also, the vaccine inoculations were increased through the number 
of birds. The result of 22nd May(Figure 18) has lush grass also 
the increasing birds which show the infection condition intuitively. 
From this visualization work, we can tell a message that even though 
the virus spread is uncontrolled, the inoculation was increased, and 
its effect will be excepted. 


4.6 Another Asset 


In this section, we will show an alternative asset for the same dataset 
of C-void19. Table5 shows the corresponding relationship in the 
Covid-19 dataset with a new asset, an amusement park[15]. 


Table 5: Corresponding relationship in Covid-19 dataset 
with a new asset 


Data information 


Asset property 


strength of lighting 
(less people more bright) 
size of foliage 


newly infected cases 


total serere cases 
death cases 
total vaccine inoculations 


amount of foliage 


size of attractions 
rotate rate of attractions 


newly vaccine inoculations 


-QueryExample3 
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Generate Unity_dv 
scene('Park') 


,env('Bushes', ‘transform', 'scale_child', 
— c.total_severe/40Q0) 
,env('Trees', ‘transform', 'scale_child', 


— ¢.newly_death/5Q) 

,env('Directional Light', 'Light', ‘intensity’, 
— (8000 - c.newly_infection) /5000) 
,env('FerrisWheel_Rotate', 


~ 'FerrisWheel_Rotation', ‘speed’, 

+ ¢.c.newly_vaccine/1000) 
,env('Carousel_Rotate', 'Carousel_Rotation', 

+ 'speed', c.c.newly_vaccine/10Q00) 
,env('FerrisWheel', 'transform', ‘scale’, 

—+ c.total_vaccine/7000000 + 0.5) 
,env('Carousel', 'transform', ‘scale’, 

+ ¢.total_vaccine/7000000+ 2.5) 
,annotation('textbottom', c.date || ',' |] ‘newly 
+ infection’ || ':' || ec.newly_infection) 


from covid19 c 
where c.date = '2021-3-22' 


Figure 19: Dataset of 22nd March 2021 about Covid-19 with 
a new asset 


Figure 20: Dataset of 22nd April 2021 about Covid-19 with a 
new asset 
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Figure 21: Dataset of 22nd May 2021 about Covid-19 with a 
new asset 


Figure19, 20, and 21 show the infection condition of 22nd March, 
Apil, and May. We show the range of presentation of this system 
from these visualization works because users can choose any asset 
they want to use. 


5 EVALUATION 


For evaluation, we get ready to invite the experimenter to use 
this system and compare it with other data storytelling systems in 
expression, learning cost, and amount of code. 


6 CONCLUSIONS 


We implement the StoryGenerator to generate a static data video 
through the timeline grouper at the present stage. We will use the 
cinematic asset to design and implement the StoryEditor to achieve 
a high-quality and more ac data video in the future. 
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ABSTRACT 


Nowadays, companies manage a large volume of data usually or- 
ganised in "silos". Each "data silo" contains data related to a specific 
Business Unit, or a project. This scattering of data does not facili- 
tate decision-making requiring the use and cross-checking of data 
coming from different silos. So, a challenge remains: the construc- 
tion of a Business View of all data in a company. In this paper, we 
introduce the concepts of Enterprise Knowledge Graph (EKG) and 
Decentralised EKG (DEKG). Our DEKG aims at generating a Busi- 
ness View corresponding to a synthetic view of data sources. We 
first define and model a DEKG with an original process to generate 
a Business View before presenting the possible implementation of 
a DEKG. 


CCS CONCEPTS 


- Information systems — Enterprise applications; Information 
integration; 
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1 INTRODUCTION 


Today’s organisations have large volumes of production data and 
documents scattered across multiple sources and heterogeneous 
environments. One of the most popular ways to store this produc- 
tion data is to organise it as "data silos". This type of storage allows 
companies to organise data according to different criteria specific 
to their activities (by project, by supplier, by customer, by Business 
Unit, etc.). Each silo allows local data management with an adapted 
storage system. However, the isolation of the data contained into 
silos is a major drawback: this can lead to redundancy or even 
strong inconsistencies between different silos. Also, the company 
management staff does not have a Business View of the data due to 
the isolation of those silos. In addition, a lot of documents generated 
in the companies are also not intensively used for decision-making. 

Different Unified Views construction strategies currently ex- 
ist, such as Data Warehouses, or Data Lakes. Data Warehouses 
(DW) are Business Intelligence (BI) databases used to centralise 
useful data for the decision-makers of an organisation [21]. Data 
Warehouses contain only a part of production data, which is pre- 
determined and modelled to match specific needs. The integration 
of this data is named Extract, Transform, Load (ETL). On the other 
hand, Data Lakes (DL) ingest raw data from multiple sources (struc- 
tured or unstructured data) and store them in their native format 
[26]. Also, an advantage of DL is that the data is prepared only 
when they’re used by a user. As opposed to the DW, the Data Lake 
ingest as much data as possible in the organisation. It also does 
not require a modelled schema, and can be then used to a greater 
number of users that are not decision-makers. 

The Enterprise Knowledge Graph (EKG) is one of the newest 
approach that also can be used as a solution to the “data silos 
problem” in a company [18]. It is defined as a structure used to 
"represent relationships between the datasets" [8] but also as a 
"semantic network of concepts, properties, individuals and links 
representing and referencing foundational and domain knowledge 
relevant for an enterprise" [11]. An EKG particularly highlights the 
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relationships existing, for instance, between the currently existing 
information in the company. 

Unfortunately, organisations still have some difficulties to create 
their own EKG and to integrate in a single schema the large amount 
of heterogeneous data, information and documents available in the 
numerous sources they own. Moreover, the obtained model do 
not really abstract the data to concepts, making its exploitation by 
decision-makers difficult. 

In this paper, we propose an original process to generate a Busi- 
ness View of multiple data sources within an Enterprise Knowl- 
edge Graph. The Business View is a synthetic view allowing a 
non-technical end-user to explore and discover the data of an or- 
ganisation. Thus, we first detail the different concepts related to our 
proposal and then describe the generation of such a view through 
different steps within a Decentralised Enterprise Knowledge Graph. 


2 RELATED WORK 


Initially, the concept of Knowledge Graph (KG) is defined in the 
Web Semantic domain. The goal of a KG in Web Semantic is to build 
an "Entity-centric view" of the data from multiple sources [13]. A 
KG links multiple resources from different websites together. A 
resource can be anything, from a webpage to an open data API 
endpoint that represents an entity. Back in 2012, Google was able to 
build a Knowledge Graph based on different sources, like Freebase 
or Wikidata [29]. 

Outside the Web Semantic domain, a Knowledge Graph may be 
defined as a graph of entities and relationships [24] or "a network 
of all kind of things which are relevant to a specific domain or to 
an organization [...]" [7]. Ehrlinger et al. definition is equivalent 
to Blumauer’s one, except that the possible usage of inference in 
a Knowledge Graph is added: "A knowledge graph acquires and 
integrates information into an ontology and applies a reasoner to 
derive new knowledge" [10]. Referring to these previous definitions, 
a Knowledge Graph may have technical characteristics but it is 
specific to a particular domain of research. 

An Enterprise Knowledge Graph, a KG applied to an organisa- 
tion, is a private Knowledge Graph containing private and domain- 
specific knowledge. The EKG can be used as a Business View that is 
able to solve the "data silos problem" [18]. The goal of such an EKG is 
to help the users to represent, manage, exploit the knowledge of an 
organisation. It also allows machine-to-machine inter-operability 
using the stored knowledge. The Data Source of the EKG is usually 
centralized [32]. 

Some papers explain the implementation of Enterprise Knowl- 
edge Graphs [11, 30, 33]. Additionally, some tools allow to integrate 
database contents [1, 31] or documents [9, 14] into a Knowledge 
Graph or even directly from texts through Named Entity Resolution 
[12] and Thematic Scope Resolution [4, 12]. A few less researches 
have been done around the querying side of Enterprise Knowledge 
Graphs [30]. Based on the literature, the Enterprise Knowledge 
Graph (EKG) seems to be a great support for a Business View, as 
it supports a lot of flexibility. Moreover, relationships may help to 
understand how the data is related for decision-making processes, 
and also help how to infer new knowledge from the existing one. 
However, neither a clear definition of what an Enterprise Know]- 
edge Graph nor a global architecture are defined in the literature, 
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as far as we know. We also identify that no process to build an EKG 
schema from the available data exists. 

Intending to propose a new way of building Enterprise Knowl- 
edge Graphs, we also studied the Schema Matching. The Schema 
Matching is a a set of methods allowing to "determine which con- 
cepts of one schema match those of another" [23] in distributed data 
sources. Schema Matching has been well studied in the literature, 
both Database Matching [16, 25] aiming at defining methodolo- 
gies to match different relational databases, and Ontology Match- 
ing [3] which aims more at matching web semantic ontologies. 
New approaches based on similarity have been developed to al- 
low schema matching on graph data structures [22], and improve 
current Schema Matching techniques [20, 27]. 

In the next section, we detail our own definition of the Enterprise 
Knowledge Graph as a Business View of company data. We also 
define the concept of Decentralised EKG (DEKG). We propose a 
process to build a DEKG Business View schema from existing data 
using schema matching, before discussing its architecture and its 
implementation. 


3 ENTERPRISE KNOWLEDGE GRAPH 
3.1 From EKG to DEKG 


In our context, an Enterprise Knowledge Graph (EKG) represents 
all the source data of interest for the company in order to offer end 
users a Business View of this data. Moreover, an EKG should em- 
phasise the relationships that exist between them. Such a Business 
View allows end-users to identify, locate and access and finally anal- 
yse these data. An EKG also helps them in decision-making tasks 
thanks to the knowledge they can extract from the Business View. 
Such Business View mostly corresponds to the result of the inte- 
gration of the different source’s schemata. An EKG should be easily 
extensible in terms of data (i.e. should ingest new data sources) 
and user needs (i.e. queries, exploration capabilities..), without the 
need of human intervention. 

In this paper we define an extension of such a concept that is 
named a Decentralised Enterprise Knowledge Graph (DEKG). The 
decentralised dimension of a DEKG aims at offering a better scala- 
bility of the underlying system since data are not integrated in a 
centralised system (i.e. only data source schemata are centralised). 
We based our approach around a global-as-view approach [19]. 

Furthermore, in contrast with the above definition, an DEKG 
proposes to end-users a Business View that is a synthetic view 
of source data through a unified schema. It is a schema that cor- 
responds to a "end-user view" of the global schema content. It 
"erases" the specific implementation particularities present in the 
different data sources. Such a view allows the end-user to under- 
stand what information is available, how it’s linked and in which 
database/repository it is stored. 

To obtain such a Business View, a DEKG implementation is based 
on a 3-steps process (see Figure 1): 


e the first step aims at generating data sources schemata from 
the different data sources, one independently to others; 

e the second step aims at generating a global schema that is 
the result of the schema matching of the different data source 
schemata; 
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Figure 1: Decentralised Enterprise Knowledge Graph workflow 


e the third step is more original since it aims at synthesising the 
global schema into a synthetic and understandable schema 
that will be proposed as the Business View to the end-user. 


Please note that the DEKG always keeps all the data sources while 
generating all the different schemata in each step. 

Since a DEKG should emphasise the relationships between data and 
data schema, we propose in this paper to model a DEKG as a set of 
graphs. These relationships present in the different DEKG schemata, 
will help decision-making by showing the "context" of the available 
data. These relationships also support graph exploration or new 
knowledge discovery. 


3.2 Our DEKG construction process 


Our process generates different schemata. According to the Figure 
1, we define the concepts of source schema, global schema and 
Business View and explain how to generate them. Moreover, in 
order to explain every schema generation, we propose an illustrative 
example. 

Example. Three data sources are available (one relational database 
containing 2 tables and 2 CSV files). The content of each data source 
is presented in the table 1. 


3.2.1. General schema graph model. All the schemata generated 
by the DEKG is modelled as an heterogeneous graph, based on the 
Property Graphs [5, 17]. We define a graph g as follows: g = (V,E); 
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= {th,..., Jy} is the set of nodes and E = {e},...,e¢} the set of 
a with E C V x V. Each node has a label r that belongs to 
TSA tye te}. The function w : ? > Tis used to return the label t 
of a node. Every edge also has a label 1 belonging to M = {14,..., ty} 
and the function n : e — p returns the label y of an edge. Node and 
edge labels can be characterised by a set of attributes belonging to 
X = {X1,.. Xx}. Those attributes, in the context of an Enterprise 
Knowledge Graphs, are also called properties. To simplify, we ignore 
in this definition all companion functions (modifying an attribute, 
get the list of attribute of an edge or a node...). 

Thanks to these definitions, we define every DEKG schemata in 
the next sections. Due to space limitation, we limit our discussion 
to structured data sources only. 


3.2.2 Source Schema - Step #1. The source schema is composed 
of all the schemata of all sources handled by the DEKG. This schema 
results from the extraction of the structure (e.g. list of attributes) 
of every source independently from others. In order to facilitate 
schema matching done at step #3 we decided to not store attributes 
into nodes and choose an "Exploded" Graph. Thus, in the source 
schema, any node @ in V corresponds to either an entity (e.g. the 
name of a table in a relational database), either a relationship be- 
tween two entities (e.g. a foreign key in a relational database), either 
an attribute (property) characterising an entity. The set of node 
types of V is defined as T = {REL, ENTITY, PROP} where "REL" 
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Table 1: Data Source structures and content 


id| Last Name}F irst Name/Class ( FR) id|'leacher (IF K)||id|/ Full name 


DATABASE 


Students ‘lable Class ‘Table ‘Teachers ‘lable 


|John A |[A|Alice Machin ||1 [Alice Machin 


Jane I 3|Bob Lambda ||2 |Bob Lambda 


1} John Doe 3 | John Doe 
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People.csv 


Grades.csv 


LastName] irstName||FULL NAME|GRADE/COURSE 


Machin Alice John Doe 
Lambda |Bob Jane Doe 


15 M3xxx 
16 Mixxx 


Doe John 


Doe Jane 


corresponds to a relationship, "ENTITY" to an entity and "PROP" to 
a property. 

In order to properly connect the different nodes, we define the 
different types of any edge e in Eas: M = {hasProp, hasRel} where 
"hasProp" connects a node of type ENTITY or REL to a node of type 
PROP and "hasRel" connects a node of type ENTITY to another 
node of type ENTITY. 

Moreover, to keep the location of data within the data source, ev- 
ery node or edge are characterised by a minimal set of attributes 
X = {NAME, ID, URIS} where NAME corresponds the the name 
of the entity/property/relationship, ID corresponds the identifier 
of the node/edge and URIS corresponds to a set of URIs. Every URI 
corresponds to the data source URI where the data is located (enti- 
ty/relationship). 


In the following sections and figures, we will represent Exploded 
Graphs, such as Figure 2, as Properties Graph where our Exploded 
Graph types T (< entity >, < rel > and < prop >) are displayed as 
labels, and the node attributes are stored as properties (for clarity 
sake, all the properties, except the name, are not depicted on the 
figures). 


Database schema (3 tables schemata connected through foreign keys) 


hasProp 


<entity> 
Student 


<entity> 
Class 


hasRel 


hasProp 


hasRel 


<entity> 
Grades 


“People.csv” schema “Grades.csv” schema 


Figure 2: Data source schemata extraction result 


So, as a result the Source schema contains all the data source 
schemata that are not yet connected. The objective of the next step 
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is to construct the global schema. Figure 2 shows the Source schema 
extracted from our example data sources (Table 1). 


3.2.3 Global Schema - Step #2. The global schema corresponds 
to the result of schema matching of the all data source schemata 
available in the Source schema. 

Inspired from previous work in schema matching, in this paper, 
we define five additional edge types, meaning that the edge types 
set M in the global schema are defined as: 


M = {hasProp, hasRel, identical, similar, 


extends, includes, aggregation} 


To go deeper in every type, we defined them as following: 


e identical L;: can be applied to an edge between two entities, 
that highlights that they are textually identical and should 
be treated as exactly the same entity; 

e similar: can be applied to an edge between two entities, two 
relationships or two properties, that highlights that they are 
related and eventually could be treated as the same entity, 
relationship or property; 

e extends: can be applied to an edge between two entities, or 
two relations, that shows that one entity/relation is a "super- 
class" of another entity/relation. The superclass ontologically 
represents a more broad entity/relationship; 

e includes: can be applied to an edge between two properties 
that highlights that values of a property class are included 
in another property class at a certain rate p; 

e aggregation: can be applied to an edge between three or 
more properties that highlights that a property is a com- 
bination of two or more other properties. Somehow, those 
properties could be treated as similar in a low-granularity 
view of the schema. Introducing aggregation inside a graph 
schema makes it an n-uniform hypergraph, as we are linking 
more than two nodes together. 


To infer new edges of such types in the global schema, we also 
define eight rules that exploit graph structure and data from data 
sources to create new relationships. These rules and global schema 
mapping algorithm are detailed in section 3.3. 


3.2.4 Business View schema - Step #3. The Business View schema 
is a schema that abstracts the schema matching process that was 
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run in the previous steps. At the step, the schema returns to a 
Property Graph as defined in Section 3.2.1. 

The way this Business View Schema is generated is detailed in 
section 3.4. 


3.3 Constructing the DEKG Global Schema 


In our process, the main step is the definition of the global schema 
since it is the result of schema matching of all the source schemata. 

In order to connect nodes of the different graphs of source data 
with new edges we define ten rules that aim at creating a specific 
type of edge: 


M = {hasProp, hasRel, identical, similar, extends, includes, aggregation}. 


Such rules will allow to consider multiple entities, properties, or 
relationships and related data. This section details these rules and 
the way they are applied through a specific algorithm. 


3.3.1 Matching schemata rules. To facilitate decision-making 
on a global view, it’s necessary to define new relationships between 
components of graphs representing source schemata. In our ap- 
proach, we propose a schema matching based on rules specifying 
automatically new relationships and/or nodes. We based them on 
both Database Matching [25] and Ontology Matching [3] ideas to 
propose both a relational and a semantic approach. Those rules 
cover multiple approaches of schema matching defined by Rahm et 
al: Both schema-based and instance-based, and using linguistic and 
constraints. 

These rules are designed to be generic and automatically ran, 
but they also can be completed by multiple sets of domain-specific 
rules depending on the information the company wants to include 
into the Business View. 


e R1: We create an "identical" type edge between two entity 
nodes if they have a strictly equal value for attribute NAME; 


hasRel 


<entity> 
Student 


hasProp 


~ 
e ~ 
similar ~ 


/ 
Last 1 af <prop mh {ent} 
Name First ‘ 
y 6 eo \ . 
L l . \ 
\ ‘ 
sage di \ \ extends similar {prop} 
J Mprop} "ee 
y ‘ 


extends 
{prop} 


‘ 
hasProp \ 


hasProp 
\ {val, pct} 


Includes 4 
{val, hy 
4 


Figure 3: Full matched Global Schema example 
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e R2: We create a "similar" type edge between two entity 
nodes if their NAME are semantically close [15]; 

e R3: We create create a "similar" type edge between two 
relationships A and B if A is linked by "hasRel" to an entity 
C, B linked by "hasRel" to another distinct entity D, where C 
and D are linked by a "similar" edge. A and B must also be 
linked to a third common entity E; 

e R4: We create an "aggregation" type edge between a set of 
properties from one entity A and one property of another 
entity B if concatenated values of the properties of A equal 
the values of the property of B; 

e R5: We create an "includes"type edge from a property A to 
a second one B if the values of the property A partly equal 
property B. We store the rate of equality in the "include" 
edge; 

e R6: We create an "hasProp" type edge between an entity 
E linked to a property A by "hasProp", and a property B of 
another entity if the property A extends B; 

e R7: We create an "extends" type edge from one entity (nodes 
of type ENTITY) A to another B if all the properties (nodes 
of type PROP) of A extends or aggregates property classes; 

e R8: We create a new REL node A, an "hasRel" edge between 
A and an entity B, another "hasRel" edge between A and a 
different entity C, if all properties of B are linked by and 
"extends" or "aggregates" edge to the properties of C and B 
and C are not linked by any direct edges (extends, similar, 
identical. ...). 


Algorithm 1: Global Schema Construction: linguistic 


schema rules 


1 Input: S = sj, s2,...sn #Set of all the data source schemata (= 


Source Schema) 

2 GS=0 

3 for each data source schema s of S do 

4 copy the data source schema s into the global schema 
| (GS); 

5 for each couple of "ENTITY" nodes (e1, e2) in GS do 

#R1 

if el.name = e2.name then 

i Create “identical” between e1 and e2 (name) in GS 


9 #R2 


an 


on 


10 if wordSimilarity(e1.name, e2.name) > simThreshold 
then 
11 Create “similar” if not exists between el and e2 


(name’, ’wordsim’, word-similarity(e1.name, 
e2.name)) in GS 


12 if thesaurusSynonym(e1.name, e2.name) > synThreshold 
then 
13 Create “similar” between e1 and e2 (name’, 


‘thesaurus’, thesaurus.synonym(e1.name, e2.name)) 
in GS 


14 return GS; 
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Some of those rules can even create nodes in our Global Schema, 
meaning that it is possible to create new Entity, Relationship and 
Property (i.e. a node with one of these type). R8, for instance, is 
creating a new relationship (node of type REL) between two entities 
(two nodes of type ENTITY). 

The rules are ordered to be executed from R1 to R8 and must 
be re-executed each time one of the original data source schema 
changes. The rules have to be executed once the source schema is 


Algorithm 2: Global Schema Construction: instance based 

rules 

1 Input: S = sj, s2,...s, #Set of all the data source schemata (= 
Source Schema) 

2 GS = CurrentGlobalSchema #Result of the previous 


Algorithm 

3 for each couple of “REL” (r1, r2) nodes in GS do 

4 #R3 

5 # if r1 and r2 have a common entity through REL nodes 

6 if hasRels(r1).some(hasRels(r2)) then 

7 if 
links(nodes(r1)).filter(links(nodes(r2))).containsType([’similar’, 
‘identical’]) then 

8 i: Create “similar” between r1 and r2 in GS 


9 for each set of 2 properties or more (p1, p2,..., pn) of the 
same entity and a property p0 of another entity in GS do 
10 #R4 


11 sameNum := 0 

12 for each line of values(p1) as v1, values(p2) as v2, ..., 
values(pn) as vn do 

13 if value(p0) != v1.concat(v2, v3, ..., vn) then 

14 a sameNum++ 

15 if sameNum > 0 then 

16 Create “aggregation” between p1, p2,..., and pn 


with argument (sameNum/length(values(p1, p2....), 
sameNum/length(values(p0))) in GS 


17 for each couple of propery nodes (p1, p2) in GS do 
18 #R5 


19 includePct = values(p1).some(values(p2)) 
20 if includePct > 0 then 
21 Create “includes” from p1 to p2, with 


(includePct/length(values(p1), 
includePct/length(p2)) in GS 


22 #R6 
23 if links(p1).filter(links(p2).filter(type = ‘extends’).size > 0 
then 

24 A := nodes(links(p1).filter(type = 
‘hasProp’)).filter(type = ‘entity’)[0] B := 
nodes(links(p2).filter(type = ‘hasProp’)).filter(type 
= ‘entity’)[0] if A /= Bthen 

25 b Create new “hasProp” between A and p2 in GS 


26 return GS; 
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Algorithm 3: Global Schema Construction: instance and 

schema based rules 

1 Input: S = sj, s2,...s, #Set of all the data source schemata (= 
Source Schema) 

2 GS = CurrentGlobalSchema #Result of the previous 


Algorithm 
3 for each couple of "ENTITY" nodes (e1, e2) in GS do 
4 #R7 


5 if links(nodes(links(e1).filter(type = ‘hasProp’))).filter(type 
in ‘includes’, 
‘aggregate’).includes(links(nodes(links(e2).filter(type = 
‘hasProp’))).filter(type in ‘includes’, ‘aggregate’)) AND 
links(nodes(links(e2).filter(type = ‘hasProp’))).filter(type 
in ‘includes’, 
‘aggregate’).includes(links(nodes(links(e1).filter(type = 
‘hasProp’))).filter(type in ‘includes’, ‘aggregate’)) AND 
“all 2nd arg is 1” then 

6 i Create “extends” between e1 and e2 in GS 


7 #R8 

8 if links(nodes(links(e1).filter(type = ‘hasProp’))).filter(type 
in ‘extends’, 
‘aggregate’).includes(links(nodes(links(e2).filter(type = 
‘hasProp’))).filter(type in ‘extends’, ‘aggregate’)) AND 
nodes(links(e1)).some(e2) == 0 then 

9 Create new REL node (A) in GS Create new hasRel 
between el and A in GS Create new hasRel 

between e2 and A in GS 


10 return GS; 


constructed to start schema matching. An algorithm to construct 
the global schema is proposed in Algorithms 1, 2 and 3. 

Example. After applying the proposed algorithms on Source 
schema (see Figure 2) we obtain the Global Schema shown on the 
Figure 3. With only four sources and a few properties per source, 
we can observe that a lot of new edges (i.e. dashed edges) have 
been created between nodes from different data source schemata. 
For instance "red coloured" edges are new hasProp type edges, 
"purple coloured" edges are new hasRel type edges, whereas as 
"blue coloured" edges are similar or extends type edges. All those 
new links will be exploited when constructing the Business View 
Schema (see section 3.4). The sources URIs of every source node, 
whatever type they belong to (entity, prop or rel), is stored as an 
attribute in the matched nodes of the Global Schema. Please note 
that in Figure 3, we also stored, as an attribute in every element we 
created during this step, the rule name it has been generated with. 
That allows the system to differentiate all graph nodes/edges even 
if they have the same type. 


3.4 Constructing the Business View Schema 


As explained in section 3.2.4, the Business View Schema is one of 
the most important aspect of our proposal. That view must be built 
from the Global Schema, and allow user to see all available data at 
a glance. The construction of the Business View schema relies on 2 
phases. 
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3.4.1 First Phase. The first phase is quite usual, since it aims to 
gather all similar nodes/edges available in the global schema into a 


single node/edge. The result is called "Synthetic Global Schema". 


It is important to note that the NAME attribute of every gathered 
nodes is stored in a new attribute (a set of NAME) called CONTAINS 
in the final node. Such new attributes allow us to keep the path to 
the aggregated nodes within the global schema. Also, every source 
URIs from the sources nodes are still stored in a new attribute 
containing the set of URIs corresponding to the sources of those 
source nodes. 

To do so, we exploit the new edges created during schema matching 
in the Global Schema ; meaning that we gather all nodes connected 
by edges of type "identical", "similar", "extends", "aggregation" and 
"includes", as specified in the Algorithm 4. 


Algorithm 4: Business View: node combining 


1 Input: GS 
2 for each link R in GS do 
3 if type(R.subject) = prop and type(R.object) = prop then 


4 if R.predicate = includes(x,y) and x = 1 then 

5 i: Combine R.object into R.subject 

6 if R.predicate = aggregate(x,y) and x = 1 then 

7 i: Combine R.object into R.subject 

8 if type(R.subject) = entity and type(R.object) = entity 

then 
9 if R.predicate = extends then 
10 Combine R.object into R.subject Add R.object 
into “contains” attribute in R.subject 
W if R.predicate = similar OR R.predicate = identical 
then 
12 Create a node E in GS Set E.name = 


R.subject.name + R.object.name Combine 
R.subject into E Combine R.object into E 


13 if type(R.subject) = rel and type(R.object) = rel then 


14 if R.predicate = extends then 

15 Combine R.object into R.subject Add R.object 
into “contains” attribute in R.subject 

16 if R.predicate = identical then 

17 Create a node E in GE Set E.name = 


R.subject.name + R.object.name Combine 
R.subject into E Combine R.object into E 


18 if R.predicate = similar then then 

19 if links(R.subject).filter(type = “hasRel”).subject = 
links(R.object).filter(type = “hasRel”).subject 
AND links(R.subject).filter(type = 
“hasRel”).object = links(R.object).filter(type = 
“hasRel”).object then 

20 Combine R.object into R.subject Set 
R.subject = links(R.subject).filter(type = 
“hasRel”).subject.name + ‘-’ + 
links(R.subject).filter(type = 
“hasRel”).object.name 
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3.4.2 Example. When applying such process on the Global Schema 
(Figure 3) we obtain the resulting Synthetic Global Schema as shown 
in Figure 4. We can see that the nodes with name "Teacher" and 
"Student" have been gathered in node with name "People" since 
extends type edges have been created between these nodes in the 
Global Schema. You can also see that a new attribute named "con- 
tains" have been added to node People with the values "Teacher" and 
"Student" to keep a link to the corresponding nodes in the Global 
Schema. 


contains: 
Student-Class 


contains: 


Teacher 
Teacher-Class 


Student 


<entity> 
People 


hasRel 
<prop> 
FirstNa 
me 
hasRel 

hasProp People- 

Grades 
<entity> 

hasProp = 


Figure 4: Synthetic global schema 


LastName 
FirstName 
Grade 


LastName 


FirstName 
Type: [Teacher|Student] 


Course 


People-Grades 


People-Class {extends} 


Figure 5: Generated Business View 


3.4.3 Second Phase. While the first phase goal was to combine 

similar/identical nodes of the global schema, the second phase aims 
at transforming the "Synthetic Global Schema" into the final Busi- 
ness View in which PROP nodes are re-integrated in corresponding 
nodes and relationships as attributes. The Business View is a non- 
technical view, computed on the fly, that represents all the sources 
data in a single endpoint. It is aimed at end-users to help them 
understand and query the available organisations data. 
The types of the nodes and relationships in the Business View corre- 
spond to the NAME attribute value in the Synthetic Global Schema. 
That phase makes the Business View look like an understandable 
and regular entity-relationship meta-graph as graph databases like 
Neo4J present their schema. The algorithm 5 describes this nodes 
re-integration. 
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Algorithm 5: Business View: schema reducing 


1 for each “prop” P node do 
2 if links(P).length = 1then 


3 N = links(P).nodes[0| 

4 for each property Pr of P do 

5 i: N.properties[Pr.name] = Pr 
6 Delete links(P) 

y Delete P 


8 for each “rel” Rn node do 

9 Create a link L 

10 L.type = Rn.name 

11 for each property P of Rn do 
12 | L.P=RnP 

13 Delete links(Rn) 

14 Delete Rn 


3.4.4 Example. After converting the Synthetic Global Schema 
(Figure 4) into a Business View Schema as explained in above sec- 
tion, we obtain the Business View Schema (Figure 5). As we can 
see, this figure is quite "simple" and is synthetic enough to be dis- 
played to end-users. Such visualisation will support exploration, 
navigation or querying operators. 


4 IMPLEMENTATION OF A DEKG 


Despite Distributed solutions for Knowledge Graphs already exist 
(like Akutan by Ebay [2]), it has not been clearly defined in academic 
papers as far as we know, especially for the Enterprise Knowledge 
Graph. The current well known decentralised one is a federated 
approach of EKGs allow small subsets of information or knowledge 
to be linked together, despite being in separated Knowledge Bases 
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In order to improve maintenability, scalability and high avail- 
ability of the DEKG we propose a specific architecture (see Figure 
6) based on specific DEKG components that can be implemented as 
services. This proposal is inspired from Federated Databases [28] 
and its adaptation to the Data Warehouses [6]. 

The figure 6 follows the workflow of figure 1, from the sources 
(top of the figure) to the users (bottom). The different components 
were designed to follow the different steps of the DEKG construction 
process described at Section 3.2: the Data Component objective to 
produce the source schema (Step #1; Section 3.2.2) which creates 
the Sources Schemata; the DEKG Management System is in charge 
of Step #2 (Section 3.2.3), building and storing the Global Schema; 
endly the EKG App objective is to build the Business View for the 
User described in Step #3 (Section 3.2.4), and send queries to the 
DEKGMS. The following sections introduce those main components 
of this architecture. 


4.1 Data Components. 


Our proposition of architecture contains numerous components 
named "Data Component". Their aim is to interpret the data inside 
one or multiple data sources which are all related by its location (for 
instance, in a Business Unit). As we’re working in a "Knowledge 
Graph" environment, the Data Component will have to create a 
graph schema representing the data of the source. It is also responsi- 
ble for the link between the schema sent to the "DEKG Management 
System" and the data contained in the source. 

All those Data Components act as "bridges" between the Busi- 
ness View managed by the DEKG Schema component and the data 
sources. Indeed, when the EKG will be queried, Data Component 
will be also queried in order to obtain the corresponding data. So, 
they have to translate the query coming from the Business View 
to match the queried source (by example, querying a SQL, a docu- 
ment, a CSV...). They all are implemented differently depending on 
the information the organisation wants to expose for each source. 
Furthermore, the Data Components ensures the availability to the 
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Figure 6: Decentralised EKG architecture components 
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information. As they’re quite small, they can be scaled both horizon- 
tally and vertically to ensure both a great scalability and availability 
to the users. Finally, those components also insure the performance 
of the overall system. They can for instance cache either data or 
information, or even queries to answer faster to the queries made 
by the Decentralised Enterprise Knowledge Graph (DEKG) Man- 
agement System. 


4.2 DEKG Management System. 


As shown in Figure 6, a bigger component named "DEKG Man- 
agement System" is proposed. Its aim is to manage the User and 
Applications queries and communicate with every Data Compo- 
nent to answer to the user queries. It builds and manages all DEKG 
schemata and generates at the end the Business View. The answers, 
the schema, and queries must be transparent to the user as if it 
was querying a centralised system. The Management System is 
divided in three essential components : the "Query Decomposer", 
the "DEKG Schema", and the "Data Merger". 

The most important and most complex component of the DEKG 
Management System is the "DEKG Schema" component. It inte- 
grates, stores, and maps the different schemata - not the actual data - 
coming from all the Data Components included in the Decentralised 
Enterprise Knowledge Graph. It is in that specific component that 
the Global Schema of the section 3.3 is constructed and stored. 

The Query Decomposer is the component receiving the queries 

from the User/App. Its goal is to break down the user query into 
sub-queries, which will be sent to the corresponding Data Compo- 
nents using the DEKG schema. The Data Merger goal is to get the 
different responses from the Data Components and merge them 
back as a single response using the original user query and the 
DEKG Schema. That unified response is sent to the user. 
To do so, the Query Decomposer must use the previous Global 
Schema to expand the user query into subsets of queries for each 
source, concurrently run them and all subset responses. Those are 
received by the Data Merger, which will need to aggregate them, 
and manage the inconsistencies between all sources [19, 23]. 


4.3 Querying the DEKG Synthetic View 


Finally, the component named "EKG App" in the Figure 6 rep- 
resents the Human Machine Interface between the end-user and 
the Decentralised Enterprise Knowledge Graph. This component 
allows the user to visualise the Business View as specified in the 
Section 3.4 and shown on Figure 5. The Application goal is also to 
query the whole DEKG from a unified endpoint. 


5 EXPERIMENTAL RESULTS OF THE DEKG 


To evaluate our DEKG, we decided to implement real-world open- 
data on a real use case. Our EKG would allow a user of Toulouse or 
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its surrounding cities to know which vaccination centre they can 
go, depending either on their city name, or postal code. In France, 
a Postal Code can cover multiple cities or a city can have multiple 
postal codes: thus the importance to be able to search both by "City" 
or "Postal Code". 

To do so, we integrated the french Toulouse city Opendata (Com- 
munes Toulouse!; Code postaux Toulouse”) containing data on all 
Cities of Toulouse and ZIP Codes; and a COVID-19 dataset from 
France (Centres de Vaccination*) representing all available vacci- 
nation centres in France. 


Libcom 
Code_fantoir 
Geopoint 


includes 
Code_postal <prop> (0.01, 0.19) 
Centre_... c_com_n 

direction_... om 


<prop> 
libelle 


geoshape 


includes 
(0.19, 0.01) <prop> 
code_ins 
ee 


<entity> 
Toulouse 
Commun 
es 


<entity> 
Centres 
Vaccinati 
on 


eos ; includes(0.06, 0.09) 


<prop> 


c_com_c <prop> 
P includes(0.5, 0.01) code_po 
F Id_code_postal 
<prop> <entity> Geoshape 
gid Geopoint 


codepostal 


Toulouse 


Figure 7: Open Data Schema Matching 


As those open-data are all flat files, there were not any rela- 
tionships in their initial schemata. The only applicable rule in that 
specific case was the Rule 5 (creating includes between properties). 
After Global Schema reducing in Figure 7, we can see that we are 
able to get indirect relationships between entities. That example 
has shown us than even with if not all rules are applicable on data 
sources, our approach is able to highlight and create relationships 
between data sources. 

To continue our tests, we ran the algorithm onto our first School 
example, but also on enterprise employees skills data, acquired with 
internal survey. All the results of the different steps execution times, 
and the number of nodes and links contained in both the Global 
Schema and the Business View are presented on the Table 2. We 


‘https://data.toulouse-metropole.fr/explore/dataset/communes/information/, accessed 
2021-05-11 

“https://data.toulouse-metropole.fr/explore/dataset/codes-postaux-de- 
toulouse/information/, accessed 2021-05-11 
$https://www.data.gouv.fr/fr/datasets/relations-commune-cms/, accessed 2021-05-11 


Table 2: Algorithm runs on different datasets 
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sme Execution time (ms) Global Schema | Business View 
Source Load | Schema Load | Schema Matching | Schema Reducing Total Nodes | Links | Nodes | Links 
School Example 6.141 3.992 35,221 22 6 | 
France OpenData 96.283 424.383 563.481 52 56 
Company Team of Teams 1:02.683 (m:ss) 1:02.807 (m:ss) 
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can see that when the amount of data grows, the Schema Matching 
is the step taking the most time. Also, the Business View has a 
much more reduced number of nodes and links compared to the 
Global Schema; making it much more understandable by the end- 
user, especially when the data is clean, and therefore successfully 
matched. 


6 CONCLUSION 


Organisations really need a Unified Views of their data in order 
to strengthen their data management. We presented in this paper 
the Decentralised Enterprise Knowledge Graph as one solution 
to build a Unified View of the whole organisation data, through 
our Business View. Thus, we offered a more organisation-oriented 
definition of the Enterprise Knowledge Graph and a decentralised 
architecture that can be implemented in an enterprise. Using the 
sources schemata and schema matching, our Decentralised Knowl- 
edge Graph is able to generate a Business View of all the data and 
data-sources. This approach has been tested against sample data, 
but also on real-life data from different public sources of multiple 
providers. 

This Decentralised Enterprise Knowledge Graph can be used by 
organisation to build Enterprise Knowledge Graph which are not 
copying the original sources, while still allowing its users to query 
the source data from a single unified point. Our Business View 
method allow to show the stored Global Schema to non-technical 
end-users in an easy and understandable way. This DEKG can 
be used in a lot of applications the Enterprise Knowledge Graph 
currently used today, for instance building Data Catalogues of 
organisations. 

Despite our architecture being scalable in terms of sources inte- 
gration, the Global Schema building might grow in computational 
complexity as the sources grow. Thus, we plan on working on 
more specific processes to help the Global Schema update when the 
sources schemata are updated, while keeping the integrity of the 
schema. Using graph embeddings to enhance our current rule-set is 
also planned. Also, we'll need to work and integrate existing meth- 
ods allowing the queries decomposition and responses merging, 
that we did not describe and evaluate in this current paper. Finally, 
another future work is to handle data duplication and inconsis- 
tency when the same data can be queried from multiple sources. 
We’ve worked until now on non-duplicated and consistent data that 
showed us the feasibility of our DEKG, but we might face difficulties 
and challenges when working with low-quality data. 
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ABSTRACT 


Data lakes are supposed to enable analysts to perform more effi- 
cient and efficacious data analysis by crossing multiple existing 
data sources, processes and analyses. However, it is impossible to 
achieve that when a data lake does not have a metadata governance 
system that progressively capitalizes on all the performed analysis 
experiments. The objective of this paper is to have an easily acces- 
sible, reusable data lake that capitalizes on all user experiences. To 
meet this need, we propose an analysis-oriented metadata model 
for data lakes. This model includes the descriptive information 
of datasets and their attributes, as well as all metadata related to 
the machine learning analyzes performed on these datasets. To 
illustrate our metadata solution, we implemented an application 
of data lake metadata management. This application allows users 
to find and use existing data, processes and analyses by searching 
relevant metadata stored in a NoSQL data store within the data 
lake. To demonstrate how to easily discover metadata with the 
application, we present two use cases, with real data, including 
datasets similarity detection and machine learning guidance. 
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1 INTRODUCTION 


Data Lakes (DL) have emerged as a new solution for Big Data 
analytics. The ambition of DL is to offer various capacities such as 
data ingestion, data processing and data analysis. A DL ingests raw 
data from various sources, stores data in their native format and 
processes data upon usage [19]. It can also ensure the availability of 
processed data and provide accesses to different types of end-users, 
such as data scientists, data analysts and BI professionals. 

The major pitfall of data lakes is that they can easily turn into 
data swamps (i.e. a simple storage space containing all data without 
any explicit information on them). On the contrary, data analysts 
or other users need to retrieve and reuse the datasets, associated 
processes or analyses. To ensure an efficient capitalization of DL, 
a centralized metadata management system must be deployed. In 
the literature, several authors propose DL metadata classifications 
[3, 8, 26], and different metadata systems have been implemented 
[2, 5, 9]. However, these solutions mainly focus on data management 
and do not offer metadata on different stages of DL (ingestion, 
processing and analysis). Regarding data analysis, different models 
of machine learning (ML) or data mining analysis are proposed by 
W3C [6] and other authors [12, 18]. These solutions can support 
data analysts to make better choices on data mining process for one 
specific dataset, nevertheless, they mainly focus on the phase of 
data analysis and do not take advantage of data and data processing 
information across all datasets in data lakes. 

Surprisingly, although the ultimate goal of DL is to facilitate data 
analytics, few solution is proposed to improve the effectiveness 
and efficiency of the analysts’ work through metadata dedicated to 
support decisional analysis. Therefore, the objective of this paper is 
to address this lack. For this purpose, information produced during 
different phases of data analytics has to be taken into account. In 
particular, these metadata should include: (i) descriptive informa- 
tion about datasets (statistics on attribute values, management of 
missing values, relationships between datasets and relationship 
between attributes) and (ii) analytical information about datasets 
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(performance of implementation and parameters of the executed 
algorithms). 

The contribution of this paper is to propose a metadata model 
which contains the information produced in the different phases 
of data analytics in DL and which can facilitate data analysis tasks. 
The paper is organized as follows. Section 2 details previous works 
about metadata in an analysis-oriented purpose. In particular, we 
detail previous works about metadata classifications found in the 
Knowledge Discovery Database processes (KDD) context [3], the 
meta-learning context [21] and the data mining context [6, 12, 18]. 
Section 3 details the metadata model dedicated to data analysis that 
includes descriptive and analytical information. In Section 4, we 
detail three algorithms of metadata feeding and we spotlight two 
use cases of data analysis based on real datasets. 


2 RELATED WORKS 


Metadata management is essential for data lakes, different authors 
provide solutions with different emphases. Datasets metadata, in- 
cluding information of each single dataset and the relationships 
among them, are studied by various authors [2, 4, 13, 24]. Data 
processing metadata, aiming to improve the efficacy of data prepa- 
ration, are also studied by different authors [16, 28]. However, to 
the best of knowledge, data analysis metadata are slightly studied 
to better manage data lakes. 

We have emphasized the importance of metadata management 
for data lakes in [19] and proposed a basic metadata model in [20], 
nevertheless, the metadata model is a simple model which mainly 
focuses on datasets and the analysis-oriented and data processing 
metadata are not deeply studied. We have completed data process- 
ing metadata in [16], and we address in this paper the analysis 
metadata aspect for our metadata model. 

In the literature, the challenge of metadata for data analysis is 
particularly studied in the field of meta-learning. Meta-learning 
aims to learn from past experiences[10], in order to improve the 
effectiveness of ML algorithms. In this paradigm, metadata describe 
learning tasks (algorithms) and previously learned models (results). 
Those metadata concern, for instance, hyper-parameters of predic- 
tive models, pipelines of compositions or also meta-features. In the 
literature, two work axes stand out: (a) the first axis is focused on 
techniques learning from metadata, for example, by transfer learn- 
ing; in particular, the recent review of [10] summarizes the different 
techniques that can be investigated, (b) the second axis is focused 
on using metadata in different application contexts. For example, 
we find approaches better supporting the model selection [14] or 
the recommendation of predictive model hyperparameters [15, 22]. 
Other recent scenarios are emerging in the context of XAI (eX- 
plainable Artificial Intelligence) such as the approach of [27] which 
proposes a method to explain how meta-features influence a model 
performance and expose the most informative hyperparameters. 

In a high-level view of this work, a lack of consensus emerges on 
which metadata to use, not only in the context of ML, but also in data 
analysis more generally. The works that defines the classification 
of those metadata in a formal setting are scarce. To the best of our 
knowledge, the works of [3, 12, 18, 23] and a W3C ML model [6] 
investigate a classification of metadata related to data analysis. 
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The authors of [3] give a general overview of different types 
of necessary metadata to support a KDD process. They globally 
identify the roles and types of the KDD metadata. In particular, 
they propose a metadata classification taxonomy covering differ- 
ent types of metadata: general measures, statistical, landmarking 
measures. However, the classification of metadata is defined in 
a high level, it needs to be detailed. Moreover, dataset is one of 
the main element for data analysis, no specific metadata is given 
to better characterize it. On the other hand, the authors of [23] 
propose a classification of metadata including simple, statistical, 
information-theoretic, model-based and landmarking information. 
In addition to their classification, they propose an R package named 
MFE!. This paper has the disadvantage of focusing only on dataset 
features and it limits its contribution to a classification of metadata 
without modeling. Paper [3] is centered at the conceptual level 
whereas the paper of [23] is situated at the technical level. Regard- 
ing these limitations, our objective is to cover both conceptual and 
implementation levels. 

Regarding ML model, different ontologies of data mining are 
proposed [12, 18]. [12] contains detailed descriptions of data min- 
ing tasks, data, algorithms, hyperparameters and workflows. [18] 
defines data mining entities in three layers: specification, imple- 
mentation and application. The W3C ML model [6] is oriented to 
ML process. Nevertheless, the relationships between datasets and 
attributes are not considered. However, one of the important ad- 
vantages of data lakes is to cross multiple sources of data, so that 
the relationships between these data are crucial. Hence, adding 
relationship metadata is inevitable in order to facilitate the work 
of analysts by selecting and querying relevant datasets and corre- 
sponding attributes. Our model contains not only data modeling 
and algorithm metadata, but also the metadata generated during 
the data collection and preparation phases, including characteris- 
tics, statistical metadata, and relationships between datasets and 
between attributes. This information is essential to deal with the 
Feature Selection (FS) [25] problem. It also facilitates data anal- 
ysis tasks by allowing users to broaden their perspective on the 
data by viewing the results of performed analyses [1]. A detailed 
description of our model is discussed in the next section. 


3 A GENERIC MODEL OF 
ANALYSIS-ORIENTED METADATA 


The process of data analytics can be described by different phases: 
business requirement, data exploration, data preparation, model- 
ing/algorithms and data product [11, 17]. In order to simplify data 
analytics, we propose to manage the metadata which are applied 
mainly on the following phases : (i) Data exploration, which is 
about collecting and exploring the data that users need. (ii) Data 
preparation, which is about transforming, filtering and cleaning 
data according to users needs. (iii) Modeling/algorithms, which 
consists in choosing a modeling technique, building a model and 
evaluating the built model. 

To facilitate the data exploration phase, metadata should help 
users to find datasets that match the analysis objectives or sug- 
gest relevant datasets to enrich the analysis. To facilitate the data 


‘https://CRAN.R-project.org/package=mfe 
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Figure 1: Metadata on datasets and attributes 


preparation phase, metadata should present the data characteris- 
tics to help users to understand the nature of data and to choose 
the features of data modeling. Moreover, process characteristics, 
process definition, technical and content information of data pro- 
cesses should be recorded to help other users understand how data 
are prepared. To facilitate data modeling phase, metadata should 
present to users all the existing analyses and certain landmarker 
results that are carried out on the selected dataset to help users to 
choose the most appropriate algorithms. 

To answer the above mentioned requirements, we extend our 
previous basic metadata model [20] with the metadata that cap- 
italize the full experiences of data analytic. The complementary 
metadata can be classified into three categories: datasets, attributes 
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and analytical metadata. In what follows, we give more details 
about the different types of metadata in our model. Note that in this 
paper, we only introduce the analysis-oriented metadata, a more 
detailed metadata model is available on GitHub?. 


3.1 Metadata on Datasets 


Metadata on datasets allow users to have a general vision of a 
dataset and can facilitate all data exploration, preparation and mod- 
eling phases. They meet the requirements of data analysts who need 
to know, among other things, (i) dataset structure / schema, which 


“https://github.com/yanzhao-irit/data-lake-metadata- management-system/blob/ 
main/images/metadata_model_complet.png 
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Figure 2: Metadata on data analysis 


presents the attributes in a dataset and their type; (ii) dataset di- 
mension, which is useful when the limit of data volume is required; 
(iii) dataset completeness, for the reason that particular algorithms 
can only be executed with no missing values; (iv) distribution of 
attribute values, to fit the algorithms for which better performance 
is only possible when the instances of the dataset respect certain 
distributions; and (v) the relationship to other datasets, for instance, 
similarity or dissimilarity measure values can help them to decide 
whether or not to apply the same type of learning algorithm on 
another dataset. 
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To answer to the above requirements, for each dataset, we pro- 
vide 5 types of metadata of datasets (see the white classes in Fig. 1): 
(i) Schema metadata (marked in blue) concern the name and type 
of attributes in an entity class. (ii) Dimension metadata (marked in 
orange) concern the number of attributes, instances and the dimen- 
sionality (number of attributes divided by number of instances). (iii) 
Missing value metadata (marked in green) related to the number 
and percentage of missing values as well as the number and percent- 
age of instances containing missing values. (iv) The distribution 
of values is expressed by 34 metadata (marked in yellow), such as 
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attribute entropy, kurtosis of numeric attributes and standard devi- 
ation of numeric attributes. (v) Relationships (marked in purple) 
with other datasets refer to relationships that can be established 
by users, such as similarity/dissimilarity and correlation, users can 
define their own algorithms for different relationships calculating. 


3.2 Metadata on Attributes 


Metadata on attributes allow users to have detailed information of 
attributes for the chosen dataset and these metadata can also facili- 
tate all data exploration, preparation and modeling. We propose the 
attribute type, completeness, distribution of values and attributes 
relationship metadata to answer to the same user requirements at 
the attribute level. For instance, when users analyze a dataset, they 
may need to choose particular types of attributes for feature engi- 
neering, they may need to choose attributes containing no missing 
values for predictive models, or when dimensionality reduction 
approaches ( Principal Component Analysis (PCA)) are employed, 
users need to know the correlation or covariance indicators. 

Therefore, for each attribute, we provide the metadata (see the 
gray classes in Fig. 1) of (i) its type (classes NominalAttribute and 
NumericAttribute), (ii) the missing values (marked in red) which 
consist of the number of missing values, the number of non-missing 
values, the normalized missing values count and their percent- 
ages, (iii) the distribution of values (marked in pink), such as the 
entropy, distinct values, Kurtosis of the numerical attribute and 
whether the values of a nominal attribute follow a discrete uniform 
distribution, and (iv) the relationships between two different at- 
tributes (marked in light blue), for which we can devise our own 
relationships such as the Spearman’s rank correlation, Pearson’s 
correlation and mutual information. 


3.3 Metadata on Data Analysis 


As far as machine learning is concerned, especially supervised ML, 
metadata on datasets and attributes are useful but not sufficient. 
Indeed, the same dataset can be used for different analyses with 
different classes or features. Moreover, a user can also be interested 
in landmarker information which is obtained from performing basic 
learning algorithms. In addition, it is important to help users to 
obtain information on previous analyses, as existing analyses can 
be directly reused or contribute to improve current analyses. Thus, 
analytical metadata have to ensure collaborative capabilities. 
Therefore, to complete our metadata model, we add analytical 
metadata (see the dark blue classes in Fig. 2) to help users find 
and possibly reuse existing analyses with their business objectives, 
selected features and target class, implemented parameters, out- 
put model and evaluation of the result : (i) the study metadata 
(marked in red) which is a project containing different analysis, 
(ii) the feature analysis (marked in purple) which concerns the 
number of features type, the relationships with target attribute and 
the distribution of all the feature values, (iii) the target attribute 
analysis (marked in blue) in the case of supervised ML, (iv) the 
implementation (marked in green) which is carried out for the 
analyses, and (v) the model of an analysis (marked in orange). 
(vi) the evaluation of analysis (marked in yellow) with different 
measures. Note that in this paper, we define the class attribute of 
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supervised analyses as the target attribute, all other attributes are 
features. 


4 IMPLEMENTATION AND USE CASE 
VALIDATION 


The proposed metadata can be used in decision support for data 
exploration, data preparation (including feature engineering) or the 
use of model/algorithms. The metadata facilitates (i) the choice of 
the most relevant datasets and their understanding as well as the 
feature engineering and (ii) the verification of previous analyses to 
better prepare and optimize the work. 

We chose to use a graph database (Neo4j) to store these metadata 
for the reason that: (i) a good flexibility can be ensured, (ii) it is easier 
to query a graph database when the search depth is important, and 
(iii) some machine learning algorithms are integrated which can 
be useful to build a recommander system>. Note that the proposed 
metadata model can also be stored with other technologies, for 
instance, if relational database is chosen, for the following three 
algorithms, all the properties become attributes, all the nodes become 
instances of the corresponding class table, and all the relationships 
become instances of the relationship table. 

This metadata database is integrated in the data lake system 
and is managed through an application‘. In the following subsec- 
tions, we explain how to feed metadata via an application into a 
graph database by three algorithms. Then we expose two use cases 
that demonstrate the contribution of our model compared to other 
models in the literature. 


4.1 Metadata feeding 


The metadata that we introduced previously are managed by a 
locally deployed application that interfaces with user input and 
calls three algorithms to instantiate the Neo4J database according 
to our metadata model. 

In this application, users can input descriptive information, for 
instance, name and description of datasets or processes, chosen 
algorithms and set parameters, with an interface. The application 
can generate automatically other metadata, for instance, schematic 
metadata of datasets and relationships between different datasets 
to facilitate future analyses. 

During data preparation phase, when a new dataset is stored in 
the DL (by data ingestion or processing), the application generates 
descriptive metadata of dataset and its attributes automatically (see 
Algo. 1). 

During data analysis phase, for each study, users can firstly 
choose features and the target attribute to let the system calculate 
statistical metadata of features and run landmarkers for machine 
learning guidance (see Algo. 2). With the information that the sys- 
tem pre-analyzed, the user can implement different algorithms with 
different parameter settings. For each algorithm with a parameter 
setting, an Analysis and an Implementation are instantiated (see 
Algo. 3). 


$https://neo4j.com/product/graph- data- science-library/ 
“https://github.com/yanzhao-irit/data-lake-metadata-management-system 
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Algorithm 1: DatasetAttributesMetadata 


Input: connectionURL, descriptionDS 
Data: newDS 


/*x use a table to store properties */ 
1 datalakeDatasetProperties <— createProperties(’description’, 
getDatasetName(newDS), ’connectionURL’, connectionURL, ’size’, 
getFileSize(newDS)) 
2 datalakeDataset — createNode(’DatalakeDataset’, 
datalakeDatasetProperties) 
/* store schematic metadata */ 
3 datasetStruct — getDatasetStructurality(connexionURL) 
a if (datasetStruct.type = "structured" or "semi-structured") then 
5 entities[] <— getEntityClasses(newDataset) 
6 foreach e C entities do 
7 atts[] <— getAttributes(e) 
8 entityClassProperties — createProperties(name’, getEntityName(e), 
getEntityStatistics(atts|])) 
// function getEntityStatistics() returns an array including the name 
and value of each statistical metadata 
9 entityClass <— createNode(’EntityClass’, entityCLassProperties) 
10 createRelation(’DatalakeDateset-EntityClass’, 
datalakeDataset, entityClass) 
11 foreach att C atts[] do 
12 if getAttType(att) = numeric’ then 
13 numericAttributeProperties <— createProperties(’name’, 
getAttName(att), getNumericAttStat(att)) attribute — 
createNodeNeo4j(’NumericAttribute’, 
numericAttributeProperties) 
14 else 
15 nominalAttributeProperties <— createProperties(’name’, 
getAttName(att), getNominalAttStat(att)) 
16 attribute — createNode(’NominalAttribute’, 
nominalAttributeProperties) 
17 createRelation(’EntityClass-Attribute]’, entityClass, attribute) 
jx Fae each predefined RelationshipAtt we calculate the value of 
relationship between attributes x/ 
18 analysisAttributes[] <— getAnalysisAttribute(atts[], 
relationshipAtts|[]) 
19 foreach an C analysisAttributes[| do 
20 relationArr < createNode(’AnalysisAttribute’, 
createProperties(’value’, an.value) 
21 createRelation(’AnalysisAttribute-Attribute’, 
relationArr, an.attribute1) 
22 createRelation(’AnalysisAttribute-Attribute’, 
relationArr, an.attribute2) 
23 createRelation(’AnalysisAttribute-RelationshipAtt’, 
relationArr, an.relationshipAtt) 
24 else 
25 addAttToNeo4jNode(datalakeDataset, getDatasetFormat(newDS)) 
be er each predefined RelationshipDS we calculate the value of relationship 
between datasets */ 


26 analysisDSRelationships[] — 


getAnalysisDSRelation(datalakeDataset, datalakeDatasets[], relationshipDSs|[]) 


27 foreach anDs C analysisDSRelationships[] do 

28 relationDs < createNode(’AnalysisDSRelationship’, 
createProperties(’value’, anDs.value)) 

29 createRelation(’AnalysisDSRelationship-DatalakeDataset’, 
relationDs, anDs.datalakeDataset1) 

30 createRelation(’AnalysisDSRelationship-DatalakeDataset’, 
relationDs, anDs.datalakeDataset2) 

31 createRelation(’AnalysisDSRelationship-RelationshipDS’, 
relationDs, anDs.relationshipDS) 
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Algorithm 2: Pre_analysis 


10 


11 


12 


13 


14 


15 


16 


17 
18 


19 


20 


21 


22 


23 


24 


25 


26 


Input: studyName, studyDesc, datalakeDataset, targetAtt, features[], 


analysisName, analysisDesc 


Data: selected dataset ds 
/*x calculate statistics for analysis */ 
Function statisticsAnalysis (study, datalakeDataset, targetAtt, 


features([], analysisName, analysisDesc): 


analysis — createAnalysis(’Analysis’, createProperties(’name’, 
analysisName, ‘description’, analysisDesc) 
createRelation(’Analysis-DatalakeDataset’, analysis, datalakeDataset) 
createRelation(’Analysis-Study’, analysis, study) 
analysisT arget <— createNode(’AnalysisTarget’) 
createRelation(’Analysis-AnalysisTarget’, analysis, analysisT arget) 
createRelation(’AnalysisTarget-Attribute’, analysisT arget, targetAtt) 
analysisFeatures — createNode(’AnalysisFeatures’, 
createProperties(getStatAnalysisFeatures(f eatures|[]))) 
createRelation(’Analysis-AnalysisFeatures’, analysis, analysisFeatures) 
analysisNumericFeatures <— createNode(’AnalysisNumericFeatures’, 
createProperties(getStatAnalysisNumericFeatures(features|]))) 
createRelation(’AnalysisNumericFeatures-AnalysisFeatures’, 
analysisNumericFeatures, analysisFeatures) 
analysisNominalFeatures — 
createNodeNeo4j(’AnalysisNominalFeatures’, 
createProperties(getStatAnalysisNominalFeatures(features[]))) 
createRelation(’AnalysisNominalFeatures-AnalysisFeatures’, 
analysisNominalFeatures, analysisFeatures) 
return analysis 


/* run predefined landmarkers */ 
Function 
runLandmarkers (study, datalakeDataset, targetAtt, features[]): 


landmarkers[] <— getLandmarkers() 

foreach !m Cc landmarkers[] do 

analysis — 
statisticsAnalysis(study, datalakeDataset, targetAtt, f eatures[], 1m.name, 
Im.description) 

output <— runAlgo(analysis, lm) 

outputModel < createNode(’OutputModel’, createProperties(’name’, 
, description’, output)) 

createRelation(’Analysis-OutputModel’, analysis, output Model) 

foreach m C getEvaluationMeasures() do 

modelEvaluation — createNode(’MoedlEvaluation’, 
createProperties(’value’, evaluate(m, analysis)) 

createRelation(’Analysis-ModelEvaluation’, 
analysis, modelEvaluation) 

createRelation( ModelEvaluation-EvaluationMeasure’, 


modelEvaluation, m) 


27 study < createNode(’Study’, createProperties(’name’, studyName, 


‘description’, studyDesc)) 


28 runLandmarkers(study, datalakeDataset, targetAtt, features|]) 
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Algorithm 3: Analyse_dataset 


Input: study, analysisName, analysisDesc, datalakeDataset, targetAtt, features[], 


sourceCode, algo, paras|[], paraSets| | 
Data: selected dataset ds 
analysis < statisticsAnalysis(study, datalakeDataset, targetAtt, 


_ 


features([], analysisName, analysisDesc) 
2 implementation <— createNode(’Implementation’, 
createProperties(’sourceCode’, sourceCode)) 
3 createRelation(’Implementation-Analysis’, implementation, analysis) 
4 algo < createNode(’Algorithm’, createProperties(’name’, algo.name, 
‘description’, algo.description)) 
createRelation(’Implementation-Algorithm’, implementation, algo) 
foreach para C paras{| do 
parameter < createNode(’Parameter’, createProperties(’name’, para)) 


o Na ua 


createRelation(’Implementation-Parameter’, implementation, parameter) 


9 foreach paraSet C paraSets[] do 

10 parameterSetting <— createNode(’ParameterSetting’, 
createProperties(’value’, paraSet.value)) 

11 createRelation(’Implementation-ParameterSetting’, 
implementation, parameterSetting) 

12 createRelation(’ParameterSetting-Parameter’, parameterSetting, 
getParameter(paraSet.para)) 


13 output <— runAlgo(analysis, algo) 


14 outputModel < createNode(’OutputModel’, createProperties(’name’, , 


cs 


‘description’, output)) 


15 createRelation(’Analysis-OutputModel’, analysis, outputModel) 


ul 


16 foreach m C getEvaluationMeasures() do 


17 modelEvaluation <— createNode(’MoedlEvaluation’, 
createProperties(’value’, evaluate(algo, analysis)) 

18 createRelation(’Analysis-ModelEvaluation’, analysis, modelEvaluation) 

19 createRelation( ModelEvaluation-EvaluationMeasure’, 


modelEvaluation, m) 


4.2 Use Case Validation 


In this section, we illustrate two examples based on real-life situa- 
tions to show the utility of stored metadata. 

As a running example, a data analyst in biology, named Bob, 
would like to identify the indicators having an important impact 
on the colon cancer. For this purpose, he uses the CHSI° dataset 
including more than 200 health indicators for each of the 3,141 
United States counties. A data engineer constructs three extracts of 
the dataset to prepare an analysis of three cancers (colon, breast and 
lung) from this initial dataset according to the objective of binary 
classification (to have or not the cancer type). Indeed, due to the 
high number of indicators dedicated to multiple types of diseases 
(especially on cancers), an analysis can be difficult to be performed 
directly in a single dataset. 

These datasets have six indicators oriented to the analysis of 
the colon, breast and lung cancers, respectively, of people with the 
measures of obesity, high blood pressure and smoker. There are 3,141 
rows in the dataset and each of them represents a group of people 
living in a same county for each state of the United States. 


4.2.1. Use Case 1: Dataset similarity detection. Identifying the 
right ML model for a given dataset can be a very tedious task. On 
this way, detecting similar datasets where previous models have 


*https://healthdata.gov/dataset/community-health-status-indicators-chsi-combat- 
obesity-heart-disease-and-cancer 
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already been executed is a possible solution. This would make it eas- 
ier for a user to identify the type of algorithm that can be launched 
for their dataset. This approach is inspired by the automated ML 
field where a model is automatically proposed from a set of existing 
datasets and ML workflows [7]. Thanks to the dataset and attribute 
metadata, it is easier to detect stored close datasets, as long as a 
proximity measure has been implemented in the ML (Relationship 
metadata in Fig. 1). 

In our running example, a dissimilarity measure, proposed in 
[21], is applied (Note that other different algorithms can be also 
generated). In different analysis contexts, users can use different 
algorithms. This measure is defined on two levels. The first level 
estimates a dissimilarity between datasets (classes AnalysisDSRe- 
lationship and RelationshipDS) and the second level is between 
attribute (classes AnalysisAttribute and RelationshipAtt). 

Bob wants to retrieve more datasets which are close to the colon 
cancer dataset that he analyzes. Thus, he uses the metadata man- 
agement application to find colon cancer dataset, and goes to the 
Relationship Dataset tab to check the information that he needs 
(see left image in Fig. 3). He finds out that breast cancer dataset 
and lung cancer dataset are both similar to the colon cancer dataset 
(with the dissimilarities <0.01). 

Thanks to the result, Bob decides to enrich his analysis with the 
breast and lung cancer datasets. Moreover, Bob can searching all 
the analyses/algorithms that are already executed on these datasets. 
For each algorithm, he can get the information of used algorithm 
name, parameter names, parameter values and evaluation values 
(see right image in Fig. 3). For instance, the shown result concerns 
an analysis of breast cancer, for which a random forest algorithm 
is preformed with two parameters (n_estimators = 10, test_size = 
0.2) and the result is evaluated by accuracy (0.42). 

Looking at these results, Bob decided to perform a Random 
Forest and SVM algorithms on his dataset because they give the 
best results in terms of accuracy on the breast (0.48) and lung cancer 
(0.58) analyses. 


4.2.2 Use Case 2: Machine learning guidance. Metadata based 
on landmarkers are a powerful solution to estimate which kind of 
machine learning model will be performed better on a dataset. By 
running a set of diverse simple models on the dataset, a baseline 
of model performances can be established. This baseline is a first 
indicator of possible performances for more complex versions of 
each model, which reduces the need for trial and error testing 
commonly associated with the choice of machine learning model. 

In our running example, four landmarkers (and derived versions 
using different parameters) are available in the Data Lake. These 
landmarkers include: Decision Tree, Naive Bayes, KNN and Random 
Forest (see class Landmarker in Fig. 2). They are evaluated thanks 
to one measure: Error rate. 

Bob wishes to execute a predictive model on the colon cancer 
dataset. To guide him, he wants to know the error rate of the land- 
markers available in the data lake. In the application, he can check 
the existing landmarkers and evaluations one by one. For instance, 
the left image of Fig. 4 concerns the landmarker RandomTreeDepth3 
and its error rate is 0.4649; the right image of Fig. 4 concerns the 
landmarker REPTreeDepth3 and its error rate is 0.5070. 
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Figure 4: results for the usecase2 


After checking all the landmarkers that have been executed on 
the colon cancer dataset, Bob finds the lowest error is performed 
by 348.0001. Therefore, Bob can use the landmarker J48 which gives 
him a basis to build a better model then. 

The application has two accesses, one is a graphical interface 


that dedicated to all users of the data lake (data scientists, statisti- 


cians and analysts) for helping them find, access and reuse existing 


datasets or analyses. For specialists who have the skills of Neo4j, 
we provide a second access for more rich research with more char- 


acteristics. If Bob does not want to click on every landmarker to 
look for the lowest error rate, he can always do advanced research 
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by writing query by himself. We provide a function of free consul- 
tation in the wapplication which requires Cypher (Neo4j querying 
language) skill. So that Bob can use the Chyper query below to find 
all the landmarkers of colon cancer dataset (left result in Fig. 5). 


MATCH 
(ds) -[rhe: hasEntityClass]->(ec: 
EntityClass) , 
(a: Analysis) -[ra:analyze]->(ds) , 
(a)-[rhi:hasImplementation ]->(lm: 
Landmarker ) , 
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Figure 5: Query and result of the free search 


(evl: ModelEvaluation) -[re: 
evaluateAnalysis]->(a), 
(evl) -[ru: useEvaluationMeasure ] - >(m) 


WHERE 
ds.name = ‘Colon cancer ' 
AND m.name = ‘Error rate' 


RETURN ds, ec, a, Im, evl, m, rhe, ra, 
rhi, re, ru 


By adding one condition on the query, Bob can get the lowest 
error rate landmarker directly (right result in Fig. 5). 


ORDER BY evl.value ASC LIMIT 1 


4.3. Use case discussions 


Machine learning guidance and finding similar datasets are actual 
topics in the literature but tackled separately. In this section we 


introduced the solution to answer these needs through the use cases. 


Our model covers complete panel of metadata that could be used for 
different use cases. If we recall the interest of metadata to prevent 
the data lake from reverting to data swamp, there is inevitably a 
discussion to be conducted on the time saved by having such a 
metadata system implemented. Thanks to our metadata model, Bob 
could automatically access to various datasets and their statistics 
information as well as the already performed algorithms. During 
the use case 1, Bob queried pre-calculated correlations during a 
feature selection task; In the use case 2, the data lake brings up the 
most relevant algorithms in front of a new dataset. Without the data 
lake metadata, Bob would have had to manipulate the correlation 
functions himself or use specific tool or programs for obtaining 
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landmarkers results and handling the similarity measures and the 
datasets. Thus, all of these tedious tasks are made transparent to 
users. Bob can define his own analysis by devising only a few 
queries on the data lake to start a machine learning process. This 
centralized architecture of metadata facilitates the work of analysis 
and saves time for Bob. 


5 CONCLUSION & PERSPECTIVES 


Without an appropriate metadata management, a data lake can eas- 
ily turn into a data swamp which is invisible, incomprehensible ans 
inaccessible. Different data lake metadata solutions are proposed 
with different emphases, however, there is no solution that focuses 
on analysis metadata which allow users to cross different datasets 
and analyses to take advantage of the existing datasets information 
and analysis experience in the data lake. 

In this paper, we propose a complete solution of analysis-oriented 
metadata for data lakes which includes a metadata model, three 
algorithms of the metadata detection and an application which 
allow users to search metadata easily. The metadata model con- 
tains not only descriptive information on the datasets (statistics on 
attribute values, management of missing values and relationships 
between datasets and attributes), but also analytical information on 
performed analyses of these datasets (studies, tasks and implemen- 
tations and evaluations of analyses, performances and parameters 
of the previously executed algorithms. The algorithms explains 
how to automatically detect analysis-oriented metadata from dif- 
ferent structural types of datasets and preformed landmarkers or 
analysis. The application allow users to find information of all the 
elements stored in the data lake in an ergonomic way. To validate 
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our solution, we illustrate two usecases with real data to explain 
the completeness, feasibility and utility of our solution. 

For our future work, we plan to include other metadata dedicated 
to additional types of analyses (other than ML algorithms) such as 
statistical or OLAP analyses.Moreover, a recommender system may 
suggest to users the most appropriate algorithms or parameters 
for different analyses. An explainability system can be consider- 
ate to explain the results of machine learning analyses as well as 
other types of analyses. Nevertheless, a massively user-oriented 
experimentation using our metadata model will be also considered. 
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ABSTRACT 


Over the years, several skyline query techniques have been intro- 
duced to handle incompleteness of data, the most recent of which 
has proposed to sort the points of a dataset into several distinct 
lists based on each dimension. The points would be accessed based 
on these lists in round robin fashion, and the points that haven’t 
been dominated by the end would compose the final skyline. The 
work is based on the assumption that relatively dominant points, if 
sorted, would be processed first, and even if the point wouldn’t be 
a skyline point, it would prune huge amount of data. However, that 
approach doesn’t take into consideration that the dominance of a 
point depends not only on the highest value of a given dimension, 
but also on the number of complete dimensions a point has. Hence, 
we propose a Priority-First Sort-Based Incomplete Data Skyline 
(PFSIDS) that utilizes a different indexing technique that allows op- 
timization of access based on both number of complete dimensions 
a point has as well as sorting of the data. 
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1 INTRODUCTION 


The skyline query is a query that results in a “skyline” — a set of the 
most interesting, or best, points from the whole dataset[2][3][4][5]. 
The “best” points are the points that are not dominated by any 
other points, i.e. a point dominates another point when it is not 
worse in all dimensions and is better in at least one. A scatter plot 


Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than ACM 
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 
to post on servers or to redistribute to lists, requires prior specific permission and/or a 
fee. Request permissions from permissions@acm.org. 

IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 

© 2021 Association for Computing Machinery. 

ACM ISBN 978-1-4503-8991-4/21/07...$15.00 
https://doi.org/10.1145/3472163.3472272 


IDEAS 2021: the 25th anniversary 


of a two-dimensional dataset of points that represent the prices 
of hotel rooms with varied prices and distances to the beach. The 
points dominated by the skyline points are located inside the region 
allocated by colored lines of their respective points. For example, 
A dominates D and E by having both lower prices and lower dis- 
tances to the beach, similarly, B dominates C, E, F, and J, etc. When 
points A, B, and G are compared, we would notice that they don’t 
dominate each other since A has the lowest distance to the beach 
and hence cannot be dominated based on this dimension, G has the 
lowest price, and B has a good combination of both, which allows 
it not to be dominated by A and G, since it has a better price than 
A and better distance to the beach than G (see figure 1). 


Distance 


Figure 1: Skyline points a, i ,and k in the illustrated example 


Most of the real-world data-driven applications have to deal with 
data that is incomplete due to factors as sensor failures, obstruction 
of signal, measurement errors, and others. Those factors make a 
problem for setting the skyline requirement for the incompleteness 
of data[6][7][8]. Finally, the other sections of this work are orga- 
nized as follows. Sorted-Based Incomplete Data Skyline Algorithm 
in section 2, our contributions in section 3, our proposed approach 
in section 4, the experiments in section 5 and finally our conclusions 
in section 6. 
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2 SORT-BASED INCOMPLETE DATA SKYLINE 
ALGORITHM 


Sort-based Incomplete Data Skyline algorithm (SIDS), where the 
algorithm creates d sorted lists based on each dimension. For ex- 
ample a sorted lists based on the movie ratings dataset, where users 
are considered as dimensions. For example(See figure 2), the first 
dimension, u1, consists of movies 5, 4, 1 and 7 in that exact order, 
since they are sorted on based on their values in this dimension 
only. Then, it chooses one of the lists as a starting point in a round- 
robin fashion. Each point of the list would be compared to each 
other, pruning away all of the dominated tuples. Then, the algo- 
rithm moves to another list. If the algorithm reaches the last list, 
the pointer to the current position is increased, and the current 
list is changed to the first one. If the tuple has been processed k 
times, where k is the number of complete dimensions, and was not 
dominated by any other point, then it is determined to be a skyline 
point. Such an approach has several distinct disadvantages: 1) It 
doesn’t take into account the distinct feature of incomplete data: 
the relative power of a point, in other words the overall number of 
points it may dominate, depends not only on the individual values 
in any given dimension but also on the number of complete points 
it has.. 2) Due to the nature of the algorithm, addition of the new 
data is very costly, as it would require to remake and resort all of 
the lists. 


ji |] ve | us |] va || us | 
mm mm mm 


U 
U5, {| 77t 1 5 3 


Figure 2: Sorted lists in SIDS 


3 CONTRIBUTIONS 


3.1 Effects of the number of complete 
dimensions on performance 


The primary contribution of this work is the exploration of the 
observed improvement of performance based on prioritization of 
points with low number of complete dimensions. To the best of our 
knowledge, this notion has not been discussed before. 


3.2 Addressing the limitation of related works 


In this work we explain and improve on some limitations present 
in related works. 


3.3 Skyline Query in Incomplete Data 


Another contribution of this work is the development of an algo- 
rithm that takes advantage of the usage of both values in a given 
dimension and number of dimensions a point has in order to im- 
prove the overall performance of skyline query in incomplete data. 
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4 PROPOSED APPROACH 


Priority-First Sort-based Incomplete Data Skyline (PFSIDS) utilizes 
a different indexing technique that allows optimization of access 
based on both a number of complete dimensions a point has as well 
sorting of the data. This approach has demonstrated evaluation 
efficiency and scalability in datasets on synthetic and real-world. 


4.1 Preliminaries 


We assume that dataset D is a d dimensional dataset with dimen- 
sions d = d1,d2,d3,d4,...,dn. Each point p E e D is represented 
by a set of values p = p1, p2, p3, p4,..., pn(see figure 3). The incom- 
pleteness of data is represented by missing values in each dimension 
present in the dataset, where missing data is denoted as -. For ex- 
ample, a point p in a five-dimensional dataset with values a, b, c,d 
in the first four dimensions and a missing value in the last dimen- 
sion would be represented as (a, b,c, —) in figure 4. Without loss 
of generality, this work assumes that all dimensions of the dataset 
have a total order, in other words, greater values are considered to 
be better. 


4.1.1 Creating Index. This step adapts the dataset into an index 
that will be used for the processing of the skyline. The pseudo-code 
is shown in Algorithm 1. In lines 1-2 a list is created for each combi- 
nation of a cumulative number of non-missing dimensions observed 
inside of the dataset. For example, the dataset shown simple table 
can show the points with cumulative numbers of complete dimen- 
sions of 1, 2, 3, 4and 5. After which, in line 3, all of the points with a 
corresponding number of complete dimensions are initialized into 
variable listPoints. Then, in lines 4-9 all of the points in listPoints 
are sorted into several arrays (see Algorithm 1). 

All the list would be arranged in increasing order of number of 
dimensions, representing a “priority” of access to the points, since 
as mentioned previously the less dimensions a point has, the more 
points it would subsequently prune (See table 2(a), 2(b), 2(c), 2(d) 
and 2 (e)) in figure 4. 


Algorithm 1 consists of the creation of a list, it receives first a 
dataset denoted D. The loop is the representation of a different 
cumulative number of complete dimensions of the points, in this 
case, all of them. We declare an array called which content our 
list, then we initialize the linePoints when at that moment the loop 
counter (i) is equal to parameters. Subsequently, we use a second 
array to attach the dimensions from our dataset denoted D. Inside 
the second loop we initialize the array and start to sort all points 
which belong to our line points and stored them in the new list. 
When both loops complete the number of line points of the dataset, 
we receive an array with all lists in ascending order. 


4.1.2 Deriving Skylines. This stage takes in the index created in 
the previous stage and produces the skyline of a dataset[9][10][11]. 
Algorithm 2 represents the pseudo-code of the procedure. First, in 
line 1 we initialize CandidateSet to be equal to the whole dataset 
and the initial Skyline to be null, as at the beginning we have to 
consider the whole dataset to be a potential skyline and the skyline 
itself should be empty because no points have been processed yet. 
Since points with multiple numbers of complete dimensions are 
repeatedly encountered inside the arrays of the index, a structure 
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roa a fafa ats. 


Figure 3: Sample of Dataset 


has to be introduced in order to decrease redundancy. Lines 2-5 
represent the initialization of processedCount and dimCount vari- 
ables for each point of the dataset. The first one keeps track of 
how many times a point has been processed and the second one 
represents the cumulative number of complete dimensions a point 
has. Thus, if the point has been processed the same amount of time 
as the number of complete dimensions it has and has not been 
dominated by any other points it could be considered as a part of 
the skyline. Then, the index is processed in a round-robin fashion, 
where a list, corresponding to the number of complete dimensions 
is accessed based on the order of priority, the points are iteratively 
processed based on the position pointer. When the last dimension 
of a given list is processed, the algorithms continue to the next list. 
After the last list is processed, the position pointer is increased by 
1. If the value is missing in an array of the index, the algorithm 
simply skips the dimension. If we take the index from Figure 7 as 
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2 Complete Dim. 3 Complete Dim. 
di | d2 | ds | de | ds di 3 | dg | ds 
p2 | pl7 | p18] pl4 | p10 - - pll| pll | p11] p7 | 
p9 p7 p9| 


Table 2(c) 


P5 | Pl 


p& | p5 


p20 | p20 


Table 2(d) 


Table 2(e) 


Figure 4: An index example of the sample dataset 


Algorithm 1: Creating lists 


Input: dataset D 
Output: lists / that contain sorted arrays for each dimension rai 
for each different cumulative number of complete dimensions across all of 
the points i do 
create a list 1; 
initialize listPoints = p € D where |p| ==i 
for each dimension d € D do 
create an array Trai 
sort all points p € listPoints based on d 
store points in rai 
insert rainto |; 
end 
end 
sort all lists ] in an ascending order of i 


return | 


Figure 5: Algorithm 1 


an example, first the list corresponding to points with 1 complete 
dimension is accessed and the position pointer is equal to 0. Then, 
the points p2, p17, p18, p14, p10 are processed. After that, the proce- 
dure continues to the next list, where p6 is processed twice and p4 
once. When the last list is finished, the position pointer is increased 
by one, and the operation advances in a similar fashion until the 
CandidateSet of points is empty. Meaning that all of the points 
inside the CandidateSet are either pruned away or promoted into 
the Skyline. 


Initially, each point would set a flag variable is dominated to False, 
as indicated in line 11. Then, if processedCount of the point is equal 
to 0 the point would be compared against all other points in Candi- 
dateSet . If the points inside the CandidateSet are dominated, they 
would be removed from the set. On the other hand, if the point is 
dominated by any of the candidates, its flag variable isDominated 
is set to True. The point isn’t removed right away because while it 
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is dominated, the point itself can still prune other points. By the 
end of exhaustive comparisons, line 21 checks if p is both inside 
the CandidateSet and isDominated and removes it if that is true. 
Lines 25-28 increase the processedCount by 1, and if the variable 
reaches the cumulative number of complete dimensions dimCount 
the point has while still being inside the CandidateSet, the point is 
moved from the set into Skyline (See figure 6). 


In Algorithm 2, the list generated from Algorithm 1 is implemented, 
the initial values are set based on Candidates equivalent as Dataset 
complete. Skyline is null at that moment and we apply an iterator 
for our loop. We can say that each point in our loop is equivalent to 
our dataset. Inside of loop we set the current position and set a flag 
variable as false. The reason is to generate an extraction of posi- 
tions,next we verify if the position is dominant if it is not true this 
register is going to be removed from CandidateSet, if the case is op- 
posite it will be considered as dominant, now we have the point and 
its state, if dominant then the point is removed from CandidateSet, 
we evaluate the current state versus next one, for trying to avoid 
redundancy. The process is repeated, again and again, and the state 
dominant (true or false) indicates which elements must be move to 
the next array Skyline. 


4.2 CandidateSet data structure 


It is important to note that throughout the runtime of the algorithm 
CandidateSet is accessed very frequently for both lookups and 
removals of the candidates themselves[12][13][14]. And as such, it 
is very important to select an appropriate data structure so that the 
algorithm does not get bottlenecked by it. Therefore a hash table 
was chosen, as it has an amortized performance of O(1) for both 
search and deletion. 


4.3 Redundant comparison minimization. 


Since all of the points in the dataset would have to be processed due 
to the cyclic dominance[15][16][17] it is critical for the performance 
of the algorithm to avoid comparisons that have already been done. 
Previous approaches proposed to use Timestamps, where all of the 
points would have their corresponding variable that would hold the 
time when it was last processed. However, to reduce the amount of 
memory used we propose to utilize the index that has been already 
created. The idea behind it is very simple: if a point is located in a 
previous list with an equal or higher position, the points have been 
already compared before[18][19][20][21]. 


4.4 Discussion on time and space complexity 


The worst case time complexity of PFSISD is O(dn?+dn log n) which 
would result in final O(n”), The same worst-case time complexity is 
observed in all skyline query techniques in incomplete data, since 
if the query fails to prune away the points, it would require to do 
exhaustive pairwise comparisons. 


The best case is O(dn + dn log n) which results in O(n log n). That 
is worse best-case time complexity than the one observed in the 
naive approach since it doesn’t sort its points in any way. However, 
it is highly unlikely to have anywhere near this performance in a 
real-world scenario. 
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Algorithm 2: Skyline Query Processing 


Input: lists / that contain sorted arrays for each dimension raj 
Output: Skyline set 5 
initialize CandidateSet = D, Skyline = @, iteration = 0 
for each point p € CandidateSet do 
initialize processedCount(p) = 0 
initialize dimCount(p) = number of complete dimensions in p 

end 
while CandidateSet + @ do 
for each list |; do 


for each complete dimension d € |; do 


p =Trai[position] 
isDominated = False 
if processedCount(p) = 0 then 


for each candidate c € CandidateSet do 
if p dominates c then 


remove c from CandidateSet 
else if c dominates p then 
isDominated = True 
end 
end 
end 
if p € CandidateSet and isDominated then 
remove p from CandidateSet 
end 


end 
processedCount(p) = processedCount(p) + 1 
if p € CandidateSet and 
processedCount(p) = dimCount(p) then 

move p from CandidateSet to Skyline 

end 

iteration = iteration + 1 

end 
end 
return Skyline 


Figure 6: Algorithm 2 


Calculating the average case is very tricky, as it would largely 
depend on the effectiveness of preprocessing. In our experience, 
datasets with anti-correlated and random distributions would ex- 
hibit time complexity very close to O(n log n). 


The memory required by SIDS[1] is higher because in addition to the 
dataset and preprocessed data it has to also store the TimesStamps 
required for minimization of redundant comparisons, whereas our 
approach utilizes the preprocessed data and therefore has no need 
for the TimesStamps. 


5 EXPERIMENTS 


Experiments have been carried out on both real-world and syn- 
thetic datasets for a more precise evaluation of the performance. 
Real-world datasets are represented by NBA dataset as an exam- 
ple of correlated distribution and the New York City Airbnb Open 
dataset (see figure 7,8, 9, and 10), since mention that hotel data can 
be seen as an example of anti-correlated distribution. The synthetic 
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dataset is generated by the NumPy python library with random 
distribution. 


NBA is a dataset that contains aggregate individual player sta- 
tistics from 67 regular seasons. The dataset represents almost 4000 
players, with 47 recorded over the years. The dataset is positively 
correlated, which means that people with high skills in some fields 
tend to be also highly skilled in other areas. The missing rate of the 
dataset is 23 percentage. 


New York City Airbnb Open dataset contains a summary of in- 
formation and metrics for listings in New York City. It contains 
47906 tuples with 6 dimensions that recorded price, number of rat- 
ings, etc. The missing rate of the dataset is 11 percent. 


Several synthetic datasets have been utilized for extensive test- 
ing. All of them have varying sizes, a number of dimensions, and 
missing rates. For fairness experiments on datasets have been car- 
ried out on 20 different seeds, after which the results have been 
averaged out. 


Processing time and a number of comparisons have been adopted as 
metrics for the evaluation of scalability, as well as the effect dimen- 
sionality and missing data have on the performance. The standard 
missing rate for evaluation of scalability and dimensionality is 20 
percent, the same missing rate was used in previous works. The 
preprocessing time required for the algorithms, unless specified 
otherwise, has been included in the results. 


5.1 Scalability 


Figure 9 shows that PFSIDS consistently reduces the overall num- 
ber of comparisons required for determining the final skyline. The 
reason being that the algorithm prioritizes the points that are more 
likely to prune the most amount of points. The reduction of a num- 
ber of comparisons in figures (8), (10), and (12) leads to improvement 
in processing time, as shown in (7), (9), and (11). The low difference 
in the number of comparisons for the NBA dataset is caused by the 
correlated distribution of data, where the more dominant points 
are located at the top of the lists. 
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Figure 7: NBA — Processing Time 
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Figure 8: NBA-Number of Comparisons 
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Figure 12: Synthetic - Comparisons 


5.2 Dimensionality 


As shown in figures 13, 14, 15,16, 17, and 18 PFSIDS outperforms 
all other approaches. Here, similar to the scalability of the size 
benchmark, we can see that PFSIDS outperforms SIDS across the 
board. NBA dataset behaves in a similar fashion to the one we 
see in Figures 13 and 14. The NYC dataset, however, performs 
similarly to our approach up until we reach 4 dimensions. Such 
behavior is caused by the fact that not all of the dimensions of the 
NYC dataset are anti-correlated. The first four dimensions have 
correlated distribution according to each other, and when anti- 
correlated dimensions are processed in later stages, the performance 
levels out to the one similar to the one observed in figures 15 and 
16. 
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Figure 13: NBA — Processing Time 


5.3 Missing rate 


Since all of the non-synthetic datasets have a static missing rate of 
data, 23 percentage for NBA and 9 percentage for NY Airbnb, only 
synthetic data has been used for this benchmark, as the missing rate 
can be easily adjusted. Figures 19 and 20 show that the difference 
in the processing time of PFSIDS and SIDS is beginning to become 
rather negligible with a missing rate greater than 70 percent. That 
is because at that point the construction of the index starts to dom- 
inate the performance of the algorithm, while SIDS starts accessing 
more desirable points with a small number of dimensions earlier 
because of the high missing rate of the data. 
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Figure 16: NYC —- Number of comparisons 
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Figure 18: Synthetic - Number of comparisons 
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Figure 19: Synthetic — Processing time 
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Figure 20: Synthetic - Number of comparisons 


5.4 Pre-processing 


As mentioned before, one disadvantage of our approach is that it 
requires pre-processing of the data, and while pre-processing of the 
data before the query is feasible in most real-world applications, 
it may still pose a concern with larger datasets, where it starts 
to dominate the performance of the query. And as such we have 
performed several tests to evaluate how pre-processing of data 
scales with different amounts of data, number of dimensions, and 
missing rates, as well as how it affects the performance of the whole 
algorithm. As can be observed in figures 21, 22, 23 the number of 
dimensions and missing rate have no observable impact on the 
performance of PFSIDS, whereas the size of the dataset linearly 
affects the time it takes to create the index. 
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Figure 22: Dimmensionality 
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Figure 23: Missing rate 


5.5 Meaningful Skylines 


We have decided to remove the points with a small number of com- 
plete dimensions, which would theoretically give us more mean- 
ingful skylines, as the points with a small number of complete 
dimensions may not be very desirable as a result. The x-axis of 
figures 24 and 25 represent the points with less than or an equal 
number of complete dimensions that are removed from the dataset. 
We can observe that the removal of points with 1 complete dimen- 
sion has a considerable effect on the processing time and number 
of comparisons for PFSIDS, after which it levels out. The SIDS al- 
gorithm seems to be unaffected by the removal because it does not 


210 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


depend on the points with a low number of complete dimensions. 
Nevertheless, PFSIDS exhibits a better performance overall. 
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Figure 24: Effect of removal of points with low number of 
complete dimensions 
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6 CONCLUSIONS 


In this work, we have proposed a new approach for processing sky- 
line queries in incomplete data, PFSIDS. We utilize the observation 
that the relative dominance of a point depends not only on the indi- 
vidual values the points have in a given dimension but also on the 
number of complete dimensions a point has. In PFSIDS, we create 
a specialized index that allows us to prioritize points in the dataset 
that are more likely to dominate a large number of points, which 
results in performance improvement. By comparing our approach 
with existing works, we observed that PFSIDS consistently reduces 
the number of overall comparisons which shortens the processing 
time considerably. The experimental results show our approach 
can achieve a performance improvement of up to several orders of 
magnitude for missing rates less than 70 percent. 
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ABSTRACT 


Adults with autism face many difficulties when finding employ- 
ment, such as struggling with interviews and needing accommo- 
dating environments for sensory issues. Autistic adults, however, 
also have unique skills to contribute to the workplace that compa- 
nies have recently started to seek after, such as loyalty, close atten- 
tion to detail, and trustworthiness. To work around these difficul- 
ties and help companies find the talent they are looking for we have 
developed a job-matching system. Our system is based around the 
stable matching of the Gale-Shapley algorithm to match autistic 
adults with employers after estimating how both adults with autism 
and employers would rank the other group. The system also uses 
filtering to approximate a stable matching even with a changing 
pool of users and employers, meaning the results are resistant to 
change as the result of competition. Such a system would be of ben- 
efit to both adults with autism and employers and would advance 
knowledge in recommender systems that match two parties. 
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¢ Information systems — Information retrieval; - Retrieval 
tasks and goals — Recommender systems. 
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1 INTRODUCTION 


Adults with autism are among the most under employed demo- 
graphics, with a recent report claiming that 85% of them are unem- 
ployed [15]. This statistic, however, is not due to an inherent lack 
of ability of adults with autism, as is proven by the fact that some 
intervention can improve rates of employment. Assisting adults 
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with autism find employment provides benefits both to the indi- 
vidual and to society. For the individual, gainful employment leads 
to financial independence which in turn leads to increased oppor- 
tunities for adults with autism. Meaningful employment also leads 
to increased self-esteem and general well-being, even leading to in- 
creased cognitive ability [9]. For society, increased independence 
also results in less social expenditure and employment increases 
tax revenue [10]. More importantly although individuals with autism 
have special skills to contribute to the workplace if applied to the 
right job, this talent is currently not being utilized. It is for this rea- 
son that there is an urgent need to find a solution to this problem. 

One technique that has proven to be successful in improving 
rates of employment and retention is job matching [4]. Resources 
for job matching for adults with autism, however, are limited. The 
problem that we tackle is to develop an algorithm to provide an 
automated job-matching system for adults with autism and po- 
tential employers who are interested in hiring adults with autism. 
Job matching has proven to be a successful technique in helping 
adults with autism find employment and remain employed. Exist- 
ing programs that utilize job matching for adults with autism in- 
clude Swedish corporation Samhall (samhall.se) and American cor- 
poration Daivergent (daivergent.com). Samhall’s system includes 
matching client’s abilities with employer’s demands using a sys- 
tem that measures 25 different traits (covering sensory function, 
intellectual ability, mental ability, social ability, and physical abil- 
ity) on 3 levels (limited, good, and high ability on the client side 
corresponding to low, medium, and high requirements on the em- 
ployer side) [16]. Daivergent uses artificial intelligence to match 
vetted candidates with jobs that they extracted from descriptions 
using machine learning [3]. Unfortunately, there is little public in- 
formation about how these corporations provide their matching 
beyond what details they choose to share with the public, both 
limiting their services to their clientele and restricting potential 
research contributions from studying their systems. 

To solve this problem, we develop a job-matching system so that 
both users, i.e., adults with autism, who look for work, and employ- 
ers can autonomously create profiles and then be automatically 
matched. Like Samhall, this matching is done by quantifying the 
skills of employees and demands of work on multiple axes. The 
goal is to a pick a match that minimizes the discrepancy between 
the user’s skills and the employer’s demands as this will minimize 
the amount of skills the user would need to develop and accom- 
modations that the employer would have to make. In addition to 
measuring the job tasks itself, demands for the application process 
such as required interview skills are also included as part of the 
work demands. 
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Our job-matching system takes into account not only the skills 
of the user, but also their interests. This not only respects the de- 
sires of the individual, but also leads to much greater productivity 
[13]. As we assume that the user’s preference is based principally 
on their interest while an employer’s preference is based princi- 
pally on the ability for a worker to perform the task effectively 
according to their skills, interest aspects are measured separately 
from skill aspects. The fact that the user’s preference and the em- 
ployer’s preference can differ significantly necessitates a system 
for finding a compromise between the two. From the interest as- 
pects, an automated ranking of employers is generated for each 
user, and from the skill aspects, an automated ranking of users 
is generated. These different rankings are combined into a single 
match using the Gale-Shapley stable algorithm [6], which finds a 
stable match. Stable means that no two participants may both have 
a higher ranked match with each other than who they were already 
paired with in the stable match. 

The proposed job-matching system advances both technology 
for assisting adults with autism in finding jobs and the knowledge 
of matching systems in general. The fact users act autonomously 
in this system means that the labor costs associated with current 
job-matching systems can be reduced. Positions are also extremely 
limited in existing job-matching systems that are geared towards 
adults with autism, so this system acts as a potential starting point 
for increasing access to many more adults with autism. It can po- 
tentially be extended to serve other populations as well. Moreover, 
our job-matching system differs from existing systems in that it is 
open to use for anyone who wishes, no manual vetting is required. 
It is also open source, so it may be built upon for further research. 


2 RELATED WORK 


While numerous companies use their own job-matching algorithms, 
academic research on the subject exists as a specific application 
in the broader field of recommendation algorithms. CASPER [14], 
one of the proposed job-recommendation algorithms, focuses on 
clustering users based on their activities while reviewing jobs so 
that collaborative filtering may be applied. Malinowski et al. [12], 
on the other hand, use the content-based filtering approach based 
on profiles that are manually entered by users. Much of the lat- 
est research works on job-matching relate to processing data from 
resumes and other sources so it can be used, with Resumatcher 
introduced by Guo et al. [8] in particular who seek to match sim- 
ilar profiles based on extracting data from unstructured resumes 
and job descriptions. Others, which focus on comparing unstruc- 
tured data so that similar profiles may be matched, include the self- 
reinforcing model proposed by Koh et el. [11] and the collective- 
learning approach developed by Cing [2]. 

Even though research works on algorithms for matching autis- 
tic adults with employers is minimal, there are substantial works 
on the subject of autistic employment in general. Of particular in- 
terest is the work of Dreaver et al. [4] who tackle the problem 
from an employer’s perspective. In addition to suggesting match- 
ing, they emphasize the importance of external supports and em- 
ployers understanding autism. Indeed, most research works focus 
on the perspective adults, with numerous studies supporting the 
efficiency of Behavior Skills Training (BST), especially when com- 
bined with prompting and audio cues. Grob et el. [7] managed to 
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achieve a 100% success rate at teaching skills using BST combined 
with prompting, while Burke et el. [1] found that BST combined 
with audio cues is six times as effective as BST by itself. 

We observe that although existing job-recommender algorithms 
cannot assign multiple users to the same job, existing job-matching 
strategies can. Moreover, existing job-matching approaches attempt 
to automate the recruitment stage in the hiring process [2], which 
aims to make suggestions to many interested job seekers instead of 
selecting one. However, this process has failed to adequately serve 
the autistic population, and for this reason we seek to work around 
it. In addition to specifically tailor to the autistic population, our 
job-matching algorithm differs from existing ones in that it relaxes 
the restriction on the one-to-one matching between users and em- 
ployers so that those who perform poorly in the existing system 
can still have unique jobs suggested to them. Instead of looking at 
the similarity between profiles in a single vector space, we consider 
the similarity in the aptitude and interest vector spaces separately 
and use this information to define a stable matching. 


3 OUR JOB-MATCHING ALGORITHM 


The central part of our job-matching algorithm is an extension of 
the Gale-Shapley stable-matching algorithm [6] so it may be ap- 
plied in cases where the basic algorithm cannot be. While our algo- 
rithm is not the first extension of the Gale-Shapley algorithm, it is 
the first to use the algorithm to generate a ranking rather than a sin- 
gle match. While such a ranking is not useful or even meaningful to 
all applications of the Gale-Shapley algorithm, it works well with 
the assumptions made in this problem of matching adults with 
autism with potential employers. The idea of generating a ranking 
makes sense in this context, since it is based on the assumption 
that the information the model has is mostly accurate information, 
but may have incomplete information relating to the preference of 
the user making their choice. Users then complete the missing in- 
formation by selecting their preferred employer from the ranking. 
We can assume that getting this information after the ranking is 
done does not violate the integrity of the results because the algo- 
rithm is a strategic proof from the perspective of the users [5]. This 
means it is in the best interest of the users to accurately give their 
interests as far as they can. Our algorithm could also potentially be 
applied to other problems where the objective is to approximate a 
stable solution in a two-sided market. Technology that could use 
such problems includes applications ranging from dating apps to 
tools analyzing various financial markets. This novel ranking idea 
may help fill in missing information that was missed during other 
parts of an automated process when tackling these problems. 


3.1 Server and Client Specifications 


Records containing user profiles and employer profiles, as well 
as additional information associated with them, are stored on the 
server. The record for a particular user contains the username, pass- 
word, a text description detailing whatever the user is offering, a 
binary flag specifying if they are an employer or not, a sequence of 
double precision floats representing the user’s profile vector, and a 
flag specifying if the user has finished creating their account (see 
Figure 1 for an example). Similarly, an employer profile contains 
the corresponding information about an employer. It is on this 
server that our job-matching algorithm is performed. The server 
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Username) Bob Manning Password | xx*****k** 


Bob Manning is a hard worker looking for a hard job. 


Description He can be contacted at bob@YouNameltcom. 


Is an Employer 


Profile (Interest and Aptitude) Vector 


Interests |0.6/} |1.5| [8.1 2.0| |3.7| |4.6) |5.3] |6.6 


Aptitude 11.4 |2.8) /3.2 12.5 |7.8} |0.9] 110.1} |9.9 


Figure 1: A graphical representation of a system record 


is designed so that a user (an employee, respectively) can commu- 
nicate with it using a specifically designed client program, and it 
supports six different contexts through which the client can send 
messages. These contexts are Home, Register, Login, Delete, Update, 
and Match (see details in Section 3.1.1). 


3.1.1. The Server’s Contexts. Sending a message to Home is used 
to establish a secure connection with the server. (See Figure 2 for 
the layout of the client-server architecture of the proposed system.) 
Messages to all other contexts are encrypted as they, in the very 
least, include the user’s password, and often contain other poten- 
tially sensitive information as well. 

The Register command is used to create an account by sending 
to the server the username and password for the account that the 
user wishes to create. As long as the username is not associated 
with any existing records, a default record will be created contain- 
ing that username and password. By default, the text description 
is an empty string, the user is not an employer, the vector is set to 
random values, and the account is not complete. 

The Login command retrieves the record corresponding with 
the username that is sent to the server and the user’s record is for- 
warded to the client if the password matches what is in the record. 
All the remaining commands will only be undertaken if the pass- 
word matches what is in the record for the username that is given. 

Delete causes the server to remove the record corresponding 
with the username that is passed to it, and Update replaces the 
record for the user with one corresponding to a serialized record 
sent with the request, and all values in the record other than user- 
name and password are set. 

Finally, Match returns the usernames and descriptions of the 
user’s top ranked matches. Before the user’s Match request may 
be satisfied, their account must be marked as being completed, 
and only completed accounts will be considered when running the 
matching algorithm. 


3.1.2 Overview of the Client. The client can be run from a JavaFx 
application and communicates with the server on behalf of the 
user. This application provides a user interface with buttons and 
text fields so the user may provide the client the information it 
needs to send its requests in a format that the user understands. 
The application also maintains a working model of the user’s in- 
tent based on user input. To guide the user through creating and 
using their profile, the application is divided into several pages, i.e., 
XML documents, that are navigated through by pressing buttons 
in the user interface (see the configuration in Figure 3). Each page 
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Figure 2: The user’s computer (on the left) communicates 
with the server (on the right) over the Internet through the 
client. Messages (in italics) over the Internet are encrypted. 


Request Yes 
Begin Se 
Home Delete 
—_ Delete Profile _- 
Modify 
Profile | ea 
Login/Register ics 
\ cae 


—— /eait Profile / 
a Update f 
i Request f _ 
/ Leave. _ 
a) / | 
oem Refresh / | 
> Es / 
Find matches 7 Match 
Request 


Figure 3: The structure of the client app. Rectangles are 
pages, i.e., XML documents, ovals are requests to the server 
(one for each context), and arrows are buttons. 


/ = 
j 


defines what the user sees, including text, buttons, and text fields. 
Some button presses also cause the client to send a request to the 
server, in which case the application must wait for the server re- 
sponse before it changes pages. 


3.2 Novel Extensions to the Gale-Shapley 
Algorithm 


Our implementation of the Gale-Shapley algorithm, which is de- 
signed for matching students with universities for college admis- 
sions and matching couples for marriage, is augmented with two 
novel extensions. This includes (i) a routine for recursively apply- 
ing the algorithm in order to generate a ranking for a user instead 
of just a single match, and (ii) a filter so that the algorithm can both 
run faster and run even when the number of users and employers 
is not equal. As there may be some inaccuracies in the stable match- 
ing due to an inability to perfectly capture all information about 
a user or employer that may be of interest to the opposite party, 
a ranking of matches is given rather than just a single match so 
that the user may choose for themselves along the matches. Our 
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method of augmenting the Gale-Shapley algorithm so that it may 
be used for ranking is unique to our work. This ranking method 
requires a filtering system, which is another augmentation to that 
algorithm that is also needed to ensure that the Gale-Shapley algo- 
rithm can be applied even with a dynamic set of users and employ- 
ers. This relaxes another restriction on the Gale-Shapley algorithm, 
which requires a static set of users. 


3.3. Stable Matchings 


While this is not the first matching algorithm to be applied to help- 
ing adults with autism find employment [3, 16], it is the first where 
all the implementation details are publicly available, and the match- 
ing algorithm used is novel. It is based on the Gale-Shapley algo- 
rithm, but it is augmented with original features. 

A matching algorithm can fulfill different criteria, with the Gale- 
Shapley algorithm finding the single stable matching which is opti- 
mal for one of the two parties that it is matching [6]. A matching 
is defined as a one-to-one mapping between two parties, with the 
pairing between a user and an employer called a match in our case. 
Furthermore, a matching is stable if no two participants may both 
have a higher ranked match with each other than who they were 
already paired with in the stable matching. (Figure 4 gives exam- 
ples of an unstable and stable matching for the same data set.) The 
reason we are choosing to find this stable matching and filtering 
rather than fulfill a different criterion is because our matches are 
non-binding, with either partying being free to accept or reject the 
match, since the user still needs to apply for the job afterwards and 
it’s still up to the employer’s discretion to accept the application. 
If a stable matching is accurate, then both the user and employer 
should have no reason not to accept the match, since they would 
not be able to find a better partner they were matched with and 
who would also reciprocate their choice. If the recommended em- 
ployer for a user is not a pairing from a stable matching, it may be 
in the advantage of the user or employer to ignore their matching, 
defeating the purpose of suggesting that match. 

For a given dataset, multiple stable matchings may exist, and 
it is possible to find the optimal stable matching according to arbi- 
trary objectives [19], but we are choosing to just use the one found 
by the Gale-Shapley algorithm. The linear programming algorithm 
necessary to find other stable matchings is both harder to imple- 
ment and slower than the Gale-Shapley algorithm, so there must 
be a compelling reason to optimize a different objective in order to 
justify using the more complex algorithm. We consider finding the 
optimal stable matching from the perspective of the user’s to be a 
good objective for the sake of benefiting the autistic community 
and using the Gale-Shapley algorithm is sufficient to reach it. 


3.4 Creating Profiles for Matching 


Before users can be matched with employers, individuals in both 
parties need to create profiles for themselves within the system. 
When making a profile, someone first specifies if they are looking 
for or offering employment, which determines if they are a user 
or an employer, respectively. In either case, the user or employer 
will be walked through more questions to continue building their 
profile. Users will be asked questions to figure out both what jobs 
interest them and what skills they have. (See Appendix A for the 
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Second Third 
Choice Choice 


User First Second Third 
Choice Choice Choice 


Alice David Fred 


Bob on Eve i. ea Fred Eve David 
David ea Fred David Eve 


o © & Alice Carol 
Employer Second Third 
Choice Choice 


Employer | First Second Third 
|David | Alice Carol 


Choice Choice Choice 
Carol Alice Bob 


Alice Bob Carol 
lFred | Alice Carol Bob 


Carol Alice Bob 
(b) A stable match 


EE Davia Fred 


Carol 


Acc Gl i 


(a) An unstable matching 


Figure 4: The red bold lines in 4(a) show an example of an 
unstable matching for a given set of rankings. It is unstable, 
since Carol and Fred would rather match with each other 
(dashed line) than their given match (Eve and Bob, respec- 
tively), whereas 4(b) is a stable match for the same data set. 


sample set of questions for user and employers created for the job- 
matching system.) Based on their responses to the questions, a nu- 
merical record will be generated with different fields correspond- 
ing with different tasks. This numerical record has two parts, an 
aptitude portion corresponding with skills and an interest portion 
corresponding with interests. Employers, on the other hand, will 
be asked questions about the job they are offering to determine the 
qualities required to obtain and excel at the job, and what qualities 
the job has which may be of interest to a user. A numeric record is 
generated for them as well whose fields directly correspond with 
those in a user’s record—for requirements, the larger the number 
in the employer’s field means the more of the corresponding skill is 
demanded from the user. This defines the employer’s aptitude vec- 
tor. Similarly, the same number in the user’s corresponding field 
means it matches a user’s interest in that respect, defining the em- 
ployer’s interest vector. The system for guiding users through cre- 
ating their profiles and storing the corresponding records is de- 
tailed in Section 3.1. When employers and users are matched with 
each other, users will evaluate employers with similar profiles, and 


vice versa. 
3.4.1 Searching for Matches. After a user has created their profile, 


s(he) has the option to look for jobs by pressing a button. This 
will trigger the matching algorithm which will then return a list 
of ranked jobs to be displayed for the user. The list includes the 
name of the employer offering the job and information about the 
job. The user can always re-press the button to refresh the list of 
jobs in case activity from other users and employers has changed 
the results, which just runs the matching algorithm again for what- 
ever user and employer profiles are currently in the system, but 
there is also a second button to reject all the jobs in the list. Choos- 
ing the latter will change the user’s record to include flags that 
specify that the user is not interested in those jobs and will not in- 
clude them in further matches, and also update the user’s interest 
components of their profile to be further away from the rejected 
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profiles. The specific details in how this is done is described in Sec- 
tion 3.6. When running the matching algorithm, users will rank em- 
ployers with similar interest vectors higher, whereas employers will 
rank users with similar aptitude vectors higher. The interests vec- 
tors (aptitude vectors, respectively) of the employers (users, respec- 
tively) measure how much the users (employers, respectively) are 
satisfied with the employers’ requirements (users’ qualifications, 
respectively) as used by our job-matching algorithm. 


3.4.2 Filtering Profiles. Our matching algorithm is based around 
the Gale-Shapley algorithm, but includes some additional steps to 
ensure that prerequisites for using the Gale-Shapley algorithm are 
satisfied, and so that it can be used to generate a ranking rather 
than just a single match. First, stable matchings are only defined 
when the two parties being matched have the same number of par- 
ticipants, so the Gale-Shapley algorithm itself requires that there 
be the same number of user and employer profiles being matched. 
To ensure this, a filtering scheme that guarantees an equal number 
of users and employers is applied so only the user and employer 
profiles which are predicted to be most likely to impact who the 
target user is matched with are considered for matching. Specifi- 
cally, the filtered employers will consist of jobs the target user is 
most likely to apply for based on being closest to what he is most 
interested in, and jobs the target user is most likely to succeed at 
based on their aptitude meeting the employer’s requirements. (Ta- 
ble 1 shows a number of sample questions used for creating the ap- 
titude vectors.) Meanwhile, the filtered users will be the users the 
target user is most likely to compete with when going after jobs 
they are interested in based on having similar interests. (Table 2 in- 
cludes a number of sample questions aimed for finding the interest 
of a potential employer and user.) Our filtering scheme is based on 
both requesting the n (> 1) different employer profiles represent- 
ing the jobs the target user is estimated to be the mostly likely to 
succeed at, and the m (= 1) employer profiles that the target user is 
predicated to be the most interested that are not already being con- 
sidered with the previous request. Multiple studies have confirmed 
that with randomly generated rankings the expected ranking for 
the final match is log(Number of Profiles) [18]. For this reason, we 
advise setting n and m to be around log of what is the estimated 
maximum number of users. In our implementation, we set n = 100 
and m = 50, arbitrary values from a range estimated to be high 
enough to obtain accurate results and small enough to run in a 
reasonable amount of time. 

To calculate the n jobs the user is most likely to succeed at, we 
first calculate the discrepancy between a user and an employer as 
taking the sum of squared differences in all the fields (represented 
as components of a vector) in the employer’s profile where the 
employer’s requirement exceeds the user’s skill as denoted by the 
corresponding field in their profile and as shown in Equation 1. 


Divapp(U,E) = )) ReLU (Ea, - Ua,)° (1) 
where ReLU(X) = X, if X > 0, otherwise, ReLU(X) = 0, and Ey, 
is the i£” component of the Employer’s aptitude vector, and Ua, is 


the i‘* component of the User’s aptitude vector. 


EXAMPLE 1. Assume that under a particular scheme, a user vec- 
tor with aptitude components is <1.2, 2.3, 4.9>, which represents 
their interview skills, noise tolerance, and loyalty, while an employer 
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with aptitude vector is <3, 2.5, 2>, meaning the employer puts sig- 
nificant emphasis on presentation during interviews, the environ- 
ment is moderately noisy, and (s)he weakly desire company loyalty. 
The user’s fit for the job would be interpreted that (s)he is suffi- 
ciently loyal, and would only need minor accommodations at most 
to handle the noise in the environment, but (s)he would need to 
work on his/her interview skills significantly so that (s)he likely 
succeeds at applying to the job. This corresponds with component- 
wise discrepancy of 3.24, 0.04, and 0, which sums to 3.28 for the 
total discrepancy. 0 


As shown in Example 1, the discrepancy represents extra labor 
the user must expend to reach the demands of the job, or additional 
accommodations the employer must make to be accessible to the 
user, since a higher discrepancy means a user is a worse fit for the 
job. We can then take those n employers with minimal discrepancy 
between them and the target user by ranking the profiles in order 
of increasing discrepancy and keeping only the top n in the ranking. 
In the case of a tie where two employers have the same discrepancy 
from the target user, the first of the two employer profiles to be cre- 
ated is given priority in the ranking, assisting employers who have 
been waiting longer to be reached in the system. The top m pro- 
files that the target user is predicted to be the most interested in 
are calculated in a similar way, but with a different formula for dis- 
crepancy. This is done by comparing the fields (again represented 
as components of a vector) related to interest instead of those re- 
lating to skill, and including all fields in the sum, not just those 
where the employer’s value is greater than the user’s as shown in 
Equation 2. 


Divine (U,E) = ) | ReLU(Ey, - Ur,)” (2) 


where Ej, is the i‘? component of the Employer’s interest vector, 
and Uy, is the ié? component of the User’s interest vector. 

Equation 2 is equivalent to the Euclidean distance between the 
interest fields of the profiles as modeled as vectors in a normed 
space. This discrepancy represents the divergence between a user’s 
ideal job and the given job. A potential scheme for interest includes 
consistency of tasks, and the social culture of the workplace, with 
higher values denoting more consistency in tasks and a more promi- 
nent social culture. 


EXAMPLE 2. Assume that the interest vector of an adult with 
autism is <2.4, 0.9> under our encoding scheme, meaning (s)he 
would like strong consistency in their tasks and would prefer not 
to interact with others while working. It is further assumed that 
the interest vector of an employer is <3.1, 2.2>, suggesting that the 
work is quite repetitive and that there is a moderate social culture in 
the workplace. The component-wise discrepancy is 0.49 and 1.69, 
which sums to 2.18. 0 


If there are less than n+m (= 150 in our implementation) em- 
ployer profiles in the system, then all of them will be considered, 
and these calculations for discrepancy to obtain the top n and m 
employer profiles can be skipped. As a result, either n+m employer 
profiles will be considered after filtering is applied, or no filtering 
will be applied and all of them will be considered, in which case 
we define E as the total number of employer profiles. 


N = min(n+™m,E) (3) 


216 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


Joseph Bills and Yiu-Kai Ng 


Table 1: Sample questions used for creating the aptitude vectors 


User’s Feedback 


[Response | Score | 


| Question — sid 


Objectives 


Employer’s Feedback 


Question Response | Score 


How are your interview skills? | Poor 12 Interview || How important is presentation Very 3.0 
Skills during interview? 


How well can you function Manageably | 2.3 Noise How noisy is your workplace? A little bit | 2.5 
in a noisy environment? Tolerance 


How loyal are you? | Extremely | 4.9 || Loyalty _|| How important is company loyalty? | Slightly 


Table 2: Sample questions used for creating the interest vectors 


User’s Feedback | 
| Question = sd | Response sid onse Score 
| Response sid 


How much ———a you like I like strong consistency | 2.4 
task consistency? 


Objectives 
Task How consistent are They are quite Sul 
Consistency || the job tasks? repetitive 


Employer’s Feedback 


Question Response Score 


What social culture do you | I prefer not to interact w/ 
desire in a workplace? others while working 


where N is the number of filtered employer profiles and is also the 
number of filtered user profiles to be considered so that we can 
ensure that the number of users being considered is equal to the 
number of employers. 

Next the N-1 user profiles with the most similar interests to the 
target user are considered so that, together with the target user, 
N user profiles will be considered, ensuring that the same number 
of user and employer profiles are considered after filtering. An ex- 
ample of the result of filtering with n = 2 and m = 1 is shown in 
Figure 5. 


George 
¥ 


Figure 5: Users (blue) and employers (orange) being consid- 
ered for matching are enclosed in rectangles. Each column 
is a sorted list, with the arrow on the left showing if the list 
is sorted by interest discrepancy (blue/solid) or aptitude dis- 
crepancy (orange/dashed). It is assumed that Alice is the tar- 
get user, so all discrepancy is measured relative to her. Note 
that n = 2 in this case, so Fred and David are enclosed in the 
orange (dashed) rectangle, and m = 1, so Eve is enclosed in 
the blue (solid) rectangle after David and Fred are passed 
over due to already being considered. Alice herself is the 
grey (dot and dashed) rectangle, and n + m - 1 users closest 
to Alice are in the green (dotted) rectangle. Together, three 
users and three employers are being considered, so a match- 
ing is defined. 
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Social What is the social cul- | There is a moderate | 2.2 
Culture ture in your job like? | social culture 


The reason users with a similar interest are considered is be- 
cause they are the most likely to compete with the target user for 
their preferred job and thus affect the results of the Gale-Shapley 
algorithm. The corresponding discrepancy in interest is calculated 
in the same way as the m employers the user is most likely to be 
interested in, by calculating the Euclidean distance between their 
interest records, and keeping those with the lowest scores. To en- 
sure N-1 users can always be filtered, n+m-1 mock user profiles 
are included within the system in addition to real user profiles. 
These mock profiles do not correspond with any individuals and 
just exist to ensure the algorithm can be applied. They are ran- 
domly generated, but form a distribution that matches that of real 
user profiles, including potential profiles that would correspond 
with non-autistic individuals in order to simulate wider competi- 
tion. While these mock profiles may influence the results of the 
algorithm as they simulate competition, they will never be target 
users, and will never compete with actual users when users apply 
for jobs after being matched. The fact that the filtering never re- 
turns more than the requested number of profiles means that the 
matching that follows will run in constant time relative to number 
of profiles, improving over the quadratic time of the unconstrained 
Gale-Shapley algorithm, though the computational time for filter- 
ing grows linearly with the number profiles. As a result, the overall 
time complexity for matching a single user is linear. 


3.5 The Modified Gale-Shapley Algorithm 


The next pre-requisite the modified Gale-Shapley algorithm needs 
before it can run is to be provided information about how each 
user being considered will rank each employer profile being con- 
sidered, and vice versa. For this, users rank employers by how in- 
terested they are in them, whereas employers rank users by how 
likely they are estimated to succeed. These ranks are calculated in 
the same way they were calculated during the filtering process, by 
sorting the calculated discrepancy so that those with the lowest 
discrepancy are most preferred. With the rankings generated, the 
modified Gale-Shapley algorithm is now applied to find a stable 
matching. 

When the algorithm starts, users are labeled as being not consid- 
ered matched, but all users are labeled as being considered matched 
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when it stops. The modified Gale-Shapley algorithm consists of 
applying the following loop, called the Gale-Shapley Loop, that 
matches and un-matches users until every user is considered to be 
matched with an employer. At that point these matches are now 
considered as the official matches which are returned. 


1. Every user who is not considered to be matched is consid- 
ered as a potential match to their current top ranked em- 
ployer. 

2. Each employer who becomes matched to their top ranked 
user is being considered as a potential match. The other 
users who were being considered to them will no longer be 
considered to be matched, and will now consider their next 
top ranked employer as their current top ranked employer. 


EXAMPLE 3. Figure 6 shows a complete run of the modified 
Gale-Shapley algorithm on a simple dataset. Step 1 is the initi- 
ation, and every following step alternates between Step 1 and 
Step 2 in the Gale-Shapley loop. Potential matches are blue (aster- 
isk/dotted), matches are green (plain/solid), and rejected matches 
are red(xed/dashed). This example does not show any cases of for- 
mer matches becoming rejected (going from green to red), but such 
behavior is possible. 0 


3.6 Generating Ranked Results 


The employer the target user is matched with is returned as his 
top suggested employer. To generate the rest of the ranking, as- 
sume that the user was not interested in the most recent employer 
that was suggested to them. To take into account this scenario, the 
components in the vector representation that measure the user’s 
interest would be updated to be further away from the components 
in that employer’s record. This is done by subtracting a weight! 
multiple of the component of a employer’s profile vector from the 
corresponding component in the user’s profile vector for each com- 
ponent representing interest as shown in Equation 4. 


U; = Ur - w x (Ey — Uz) (4) 


where U7 is the target user’s interest vector, Ey is the matched em- 
ployer’s interest vector, w is the weight given to negative feedback, 
and U, is the target user’s updated interest vector. 


EXAMPLE 4. Consider the interest vector of an adult with autism 
as shown in Example 2, which is <2.4, 0.9>, and the interest vector 
of the matched employer, which is <3.1, 2.2>. Further assume that 
the weight w is 0.5. Hence, the difference between the interest vec- 
tors of the two profiles is <0.7, 1.3>, and the updated user interest 
vector in his profile is <2.4, 0.9> - 0.5 x <0.7, 1.3> = <2.05, 0.25>. 
A visual representation of this change is show in Figure 7(a). 0 


The matched employer will also not be further considered when 
filtering employers. With this in mind, the rest of the matching al- 
gorithm can be re-applied with the updated information, i.e., up- 
dated interest fields and ignoring the last recommended profile, in 
which case it will generate a new matching and return a new match. 
(See Figure 7(b) for a new set of candidates to be considered for 
matching.) This new match can be returned as the next suggested 
1The value of the weight is determined empirically with the goal of having users 
choose higher-ranked matches. This can be determined by finding an approximately 


optimal solution to minimizing the aggregated rank users’ choices based on experi- 
mental data. 
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employer. The process can in theory be repeated until every em- 
ployer has been ranked, but it only needs to be applied until the 
specified number of employers to be displayed to the user have 
been ranked, at which point they are displayed to the user. This 
same process is used to update a user’s profile if the user rejects 
all the matches, making it so that if k results are listed at a time, 
then rejecting the results would cause ranked results k+1 through 
2k according to the original profile vector to be displayed instead, 
pressing it again would display results 2k+1 through 3k, and so on. 
This process of iteratively applying the modified Gale-Shapley al- 
gorithm on filtered results to create a ranking is novel. The optimal 
value of k depends on what is practical to display on the screen to 
the user while still retaining ease of use, we have chosen k = 5. 
Shown below is the modified Gale-Shapley algorithm. 


Algorithm. Modified Gale-Shapley Matching Algorithm (User, 
Employers, Users) 
Begin 
1. Let Target User be User, Potential Employers be Employers, 
and Potential Users as Users excluding the Target User 
2. For (i = 0;i < k; i++) DO 

(a) Apply the filter on Potential Employers, Potential Users, 
Target User to obtain N filtered-employers and N filtered- 
users. 

(b) Call Gale-Shapley algorithm (Filtered Employers, Filtered 
Users) [5] to obtain the Stable-Matching map. 

(c) Select the match for the Target User from the Stable- 
Matching map and return the Matched Employer. 

(d) Let the i?” ranked match be the Matched Employer. 

(e) Update the interest vector of Target User as TargetUser| - 
wx (Match;—TargetUser,), where X7 denotes the interest 
vector of the user or employer, and w is the weight. 

(f) Let Potential Employers be the Potential Employers with- 
out the Matched Employer. 


3. Return the ranked matches 
End 


4 SYSTEM SIMULATION AND RESULTS 


While we are supposed to test our solution on end users to see 
if the designed system is superior to other solutions in practice, 
we prove that some aspects of our solution are superior to other 
solutions instead, at least under certain conditions. Specifically, we 
have tested our matching algorithm in a simulated environment, 
showing that our matching algorithm is superior to comparable 
algorithms for the case that was simulated. This simulation was 
coded in Java and ran in the IntelliJ coding environment. 


4.1 The Simulation 


For the simulation, we created a workable system where users and 
employers continuously enter the system and leave when they are 
either hired or hire someone, respectively. Users in the system re- 
quest matches and apply to one of the matches suggested to them. 
After a user applies to the job that an employer is advertising, the 
employer will decide whether to hold, reject an applicant, or even- 
tually hire an applicant that they held. If a user is hired, they will 
be removed, along with the employer, and the count of successful 
hires will be incremented. The simulation ceases after a set number 
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Figure 6: An example of running the modified Gale-Shapley algorithm 


of steps, which was 100 in our simulation, and the number of suc- 
cessful matches is returned. This count is used to compare different 
matching schemes, which are treated as individual algorithms in 
this simulation. 

During the simulation process, we considered four different 
matching schemes, denoted “Matched", “Interest", “Aptitude", and 
“Mixed". Matched refers to ranking users using our own matching 
algorithm, whereas Interest and Aptitude rank users by interest 
and aptitude divergence, respectively, and Mixed refers to rank- 
ing users by the sum of aptitude and interest divergence when 
they are being recommended to a target user. As these different 
schemes are based on making matches from the same distribution 
of vector spaces, they are comparable, and we know that differ- 
ences between these results must be due to the matching algorithm 
itself rather than due to what data they utilized. Our matching 
scheme differs both from the comparable schemes and other ex- 
isting matching schemes as mentioned in Section 2 in that the for- 
mer considers the competition along the two aspects to find stable 
matchings, whereas the latter just try to optimize the discrepancy 
between employer and employee. For each matching scheme, we 
ran the simulation 100 times and calculated the average number of 
successful matches, as well as the variance of successful matches 
in the sample, so that null hypothesis testing could be performed. 
Based on these statistics, the null hypothesis that our matching 
scheme performs equally well or worse to comparable matching 
algorithms in terms of average successful matches under the pa- 
rameters of the simulation is rejected, proving that our method is 
superior in the case that was simulated. 

Every time the step event is executed, new users or employers 
will be generated from the same geometric distribution to simulate 
users continuously entering the system. This geometric distribu- 
tion is defined by its stopping probability, which is 0.3 in our sim- 
ulation’. Each time after a user or an employer is generated, an 
event for making their profiles will be queued. 

Once a user has created a profile, the user will queue an event to 
request matches. Out of the list of given matches, the user applies 
for the one that (s)he likes the most in terms of interest as based 
on the true vector’s values. The user will then wait until (s)he has 


20.3 is the probability that generation will stop after each user or employer is gener- 
ated and the step event ends. 
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received the rejected or hired confirmation. If rejected, the user 
will request matches again by queuing another match event and 
the application process will repeat. A constraint is that a user will 
not apply to the same job twice. If a user has applied to every job, 
(s)he will wait before refreshing the job list by queuing another 
match event to be executed during the next step. 

In addition to the aptitude vector, an employer will have a thresh- 
old for each component of the vector, and these thresholds denote 
the requirements for a job. After an employer creates his/her pro- 
file, the employer will queue a wait event and stick around until 
(s)he has users in their application queue. If an employer has appli- 
cations in the queue, (s)he will test the first user in the queue ac- 
cording to the thresholds and remove him/her from the application 
queue. If a user is tested and the user does not meet the threshold 
in some component, meaning the threshold for that component is 
greater than the value in the user’s true aptitude vector, the user 
will be rejected. If the employer rejects a user while not holding 
any users, then the employer will have to lower the threshold to 
be between the previous value and the value of the rejected user’s 
component according to a linear function, simulating the employer 
becoming more willing to accommodate as they are unable to find 
someone who can fulfill the requirements. In our simulation, the 
adjustment weight was set at 0.5, meaning that new threshold is 
set halfway between the old threshold and the user’s component. 
If the tested user meets all the thresholds, (s)he will be held if no 
user is currently being held. 

Once an employer has held an applicant, (s)he will wait a fixed 
number of steps before hiring the applicant. In our simulation, the 
wait lasts ten steps. If a tested user both has a lower discrepancy 
than the held candidate and is accepted based on the thresholds, 
then (s)he will be held and the previously held user will be rejected. 
However, if a tested user has a lower discrepancy than the held can- 
didate but is rejected due to the fact that (s)he is unable to meet the 
threshold in some component, the employer will lower the thresh- 
old, simulating the employer moving to accommodate users that 
are generally better for the position, but are not currently being ac- 
commodated. Once a user is hired, both the user and the employer 
will be removed from the system and will not be considered by the 
matching algorithm. 
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Figure 7: (a) The green (dotted) circle encloses the closest 
users in terms of interest to Alice according to her original 
interest vector, while the blue (solid) circle encloses the clos- 
est employers. Alice’ interest vector is at the position of the 
new vector after moving away from David, her top-ranked 
match. Alice’s interest vector is closer to George than Carol, 
changing who is considered in the next match. However, Eve 
remains closer to Alice than to Henry. (b) The filtered results 
for the second match. Note that the interests sorting has 
changed from the first match, but not the aptitude sorting. 
Since David was already found as a match, he is no longer 
being considered, changing who is considered in both the n 
closest by aptitude and the m closest by interest. 


4.2 The Simulation Results 


The results obtained from our simulation run are shown in Table 3. 
Psychology uses a p-value of 0.05, meaning that if the probability 
of the z-score under null hypothesis is less than 0.05, then the re- 
sults are statistically significant, and the null hypothesis should be 
rejected. In this case, all the probabilities are well below 0.05, and 
thus we can safely claim that our method is more effective than 
comparable methods for the given parameters. 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


Table 3: Results obtained from the simulation 


[Scheme || Matched | Interest | Aptitude | Mixed 


[Average (| _972 | 4899 | 4210 | 4847 
[Variance ||_13057 | 82.68 | 8420 | 9682 


Z-score for N/A 4.49 5.96 4.46 
Matched = 
Null Hypothesis 


Probability of 
z-score < Null 
Hypothesis 


N/A 3.62 2.0 4.22 
x10~° x10~? x10~° 


4.3 Observation 


While the simulation was only run for the given parameters, these 
parameters were chosen arbitrarily within a simulation scheme de- 
signed to emulate the behavior of real-world agents. Since we lack 
real-world data to fill in the parameters, and it is not possible to 
run the simulation on every possible combination of parameters, 
we can assume that the results from this set of parameters are as 
valid as any other arbitrary set of parameters. This is enough to 
establish the potential superiority our method may have it applied 
in the real-world. 


5 CONCLUSIONS 


Anecdotal reports from companies with programs that bring in 
workers with autism show that having such workers changes the 
attitudes of their co-workers [17]. It opened the minds of these 
workers to not only being accepting of a more diverse population, 
but to also consider different problem-solving strategies. The fact 
that employees with autism are now financially independent also 
decreases financial strain on relatives and their generally increased 
well-being reflects positively on everyone they interact with. In 
this paper, we have proposed a job-matching algorithm for adults 
with autism and potential employers that connects adults with 
autism with the necessary job skill required by the potential em- 
ployers and contribute to the autistic community and companies 
who can benefit from hiring diverse work force. 

Since our job-matching algorithm is agnostic to whether or not 
a user actually has a diagnosis of autism, it can also be used to 
assist individuals who do not have a diagnosis but face similar 
obstacles to finding employment. Some other disorders were po- 
tential employees with these disorders may have similar concerns 
include sensory processing disorder, social (pragmatic) communi- 
cation disorder, language disorder, obsessive compulsive personal- 
ity disorder, attention deficit disorder, and social anxiety disorder. 
While our proposed job-matching system is designed with autism 
in mind, it could be expanded to include questions relating to even 
more disorders. As the system is designed to be competitive, any- 
one could potentially sign up for it, so it could potentially serve as 
an alternative means of finding employment for adults with other 
disability in general. 


A THE APTITUDE AND INTEREST SCHEMA 


The Interest component consists of Social Culture, Repetition, Salary, 
Location and AQ score, whereas the Aptitude contains Light Sensi- 
tivity, Sound Sensitivity, Interview Skills, Loyalty, Trustworthiness, 
Attention to Detail, Office Skills-Social/-Technical, Fine Motor Skills, 
and Gross Motor Skills. Table 4 depicts some of the sample quizzes. 
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Table 4: Sample queries of the Aptitude and Interest schema 


Second 


Question First 
poe Answer Answer 


How much do 
you 
consistency 
How 
are the job tasks? 


How are your in- 
terview skills? 


How important is 
presentation dur- are 
ing interview? 


What social cul- 
ture do you desire 
in a workplace? 

What is the social 


culture 
job like? 


How well 
you function in 
a noisy environ- 


can’t stand 


like oe tasks. 


task 

as well. 
consistent Everyday is differ- 
ent at our com- 


pany. 


I perform poorly 
during inter- 
views. 


For us, interviews 
merely a 
formality and 
we do not evalu- 
ate a candidate 
based on_ their 
interview. 


I prefer to work 
alone. 


mostly 
them- 


People 
keep to 
selves. 


in your 


I cannot function 
in a noisy envi- 
ronment. 


can 


ment? 


How 
your workplace? is 


The workplace 
quiet, and 
workers can wear 
headphones ___ if 


they please. 


noisy is 


the ordinary. 
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I like some sense of rou- 
tine, but with some variety 


While the bulk of the work 
is similar daily, new prob- 
lems are always arising the 
need to be solved. 

I am not skilled at inter- 
views, but do not consider | at 
the interview process to be 
a major obstacle. 

Interviews are useful for ac- 
cessing the fitness of an em- 
ployee, but are strictly sec- 
ondary to resumes, portfo- 
lios, and referrals when as- 
sessing a candidate. 


I am fine with working 
alone but enjoy the com- 
pany of others. 
Coworkers tend to main- 
tain a casual relationship 
with each other. 


Noise can distract or bother | I 
me, but I can manage. 


There is frequently ambient 
conversation and workers 
are expected to remain alert, 
but there is nothing out of 
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Our work is ex- 
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I am highly skilled 
demonstrating 
my soft skills 
during interviews. 
We closely monitor 
the body language 
of candidates and 
other features of 
the presentation 
during interviews 
in order to access 
soft skills. 

I thrive off inter- 
acting with my 
coworkers. 

Workers are ex- 
pected to form 
intimate bonds 
with one another 
through their work. 
can effectively 
follow multiple 
conversations in a 
noisy environment. 
The workplace is 
filled with loud 


noises at all times 
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ABSTRACT 


Despite the growing popularity of techniques related to graph 
summarization, a general operator for joining graphs on both the 
vertices and the edges is still missing. Current languages such as 
Cypher and SPARQL express binary joins through the non-scalable 
and inefficient composition of multiple traversal and graph creation 
operations. In this paper, we propose an efficient equi-join algorithm 
that is able to perform vertex and path joins over a secondary mem- 
ory indexed graph, also the resulting graph is serialised in secondary 
memory. The results show that the implementation of the proposed 
model outperforms solutions based on graphs, such as Neo4J and 
Virtuoso, and the relational model, such as PostgreSQL. Moreover, 
we propose two ways how edges can be combined, namely the 
conjunctive and disjunctive semantics, Preliminary experiments on 
the graph conjunctive join are also carried out with incremental 
updates, thus suggesting that our solution outperforms materialized 
views over PostgreSQL. 
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connectivity problems; « Information systems — Graph-based 
database models; Query languages; 
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1 INTRODUCTION 


The larger availability of data, mainly as knowledge bases for scien- 
tific purposes [24], allows the combination of several different types 
of graphs. Transport networks [17], social network relationships 
[29], protein-to-protein association networks [26], and citation net- 
works [27] are all examples of graphs. In the relational model, the 
join operator is the basic operation for combining different tables. 
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We would expect to use a similar operator for combining graph 
data. As evidenced by mathematical literature [13, 14], the desired 
graph products (see §6) must combine both vertices and edges of 
our (graph) operands and produce a graph as an output. Nonethe- 
less, current literature meets no unanimous consensus for what is 
a graph join and, as illustrated in our running example, provides 
two distinct classes of graph operators, vertex joins and path joins. 
We now consider two distinct graph operands: a Researcher Graph 
(Figure 1a), and a Citation Graph (Figure 1b). 

Authors in [9] propose a vertex join with link creation (Fig- 
ure 1c): each social network user is matched to the paper where 
they appear as a first author. The black dashed edges in the figure 
provide the output of the link creation. In vertex join as graph 
fusion [15] (generalized in [4]), the two graphs should be distinct 
connected components of the same graph and the matching vertices 
are now fused into one single vertex (Figure 1d). Path joins such 
as SMJoin [10] are used for traversing one single graph operand 
at a time [11, 20], where each traversed or matched path u ~ v 
is replaced by one single edge u—v: this is possible because both 
the vertex set and edge set are considered as relational tables, and 
so the path traversal or match can be implemented as a multi-way 
join producing one single table, thus providing the new edges. The 
answer to “return the graph of papers where a paper cites another one 
iff. the first author 1AUTHOR of the first paper follows the [AUTHOR 
of the second” requires a preliminary vertex join with vertex fusion 
and then a path join. For this reason, Figure le extends the output 
from the vertex fusion and creates additional links between the 
vertices having both a C1TEs and a FoLLows outgoing edge. This 
last approach is always possible as we might force any graph query 
language to generate a graph as an output (see §6). 

At the time of the writing, both graph and relational query lan- 
guages are the only way to combine vertex with edge joins for 
returning one single graph. Such languages, however, require the 
composition of multiple clauses and operators (§6), thus resulting in 
a non-scalable and relatively inefficient query plan for big data sce- 
narios (§7). Furthermore, the lack of an explicit graph join operator 
for both query languages requires the rewriting of the graph join 
query each time the underneath graph schema changes [7], thus 
making the graph querying uneasy for real-world data integration 
scenarios such as the (semi-)automated ones described in [21]. We 
propose a novel graph equi-join algorithm over a secondary mem- 
ory graph representation merging the vertex and path equi-join 
into one single step while traversing the graph. In compliance with 
the mathematical literature, the vertex join combines two distinct 
operand vertices into one and produces one single edge out of the 
graph traversal of two distinct paths (i.e., outgoing edges). 

We now provide some use cases requiring a data combination 
from different sources for application and research purposes. For 


222 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada Giacomo Bergami 


{User} {Follows} 


(User {Cites} vu 


———$ ee 
@ ( Natip =A lice F @ {Paper} {Paper } {Paper} 
© TITLE= Graphs © TITLE= Join TITLE=OWL ® 
es os 1AUTHOR=Alice 1AUTHOR=Alice 1AUTHOR=Bob 
see 5 | es em ge 
Oo | 222 | oS ae a) & 
4 4 ‘S) {Paper$ %~/ vii {Paper} 
—_ —_ 
{User} {Follows} {User } | © pee TITLE=p-calc 
= _ 1AUTHOR=Carl 1 AUTHOR= Dan 
@ | NAME= Carl ; NAME=Dan }) @ ics = 
av { } uf 
(a) Researcher Graph, Follower relations. (b) Citation Graph, citation relations. Each paper has a first author. 
p p pap 
0 
0 ann ee Se “~~ he 
we ne {Cites} vs, 
{Employee} | Follows} {Employee} +” {Paper} 


a {Paper}, ‘Rgper} 


c : 
® | NAME= Alice ) a NAME=Bob ) @ TITLE= Graphs © TITLE=Join TITLE= OWL ® 
i ee 1 AUTHOR= Alice 1 AUTHOR= Alice 1 AUTHOR= Bob 
ron ig ae aS es Ba > 
o ee eee “5° vi | 2 ae 
S | te 7 u|o £ ey /.. 
3 3 o {Paper} ~/ vii {Paper} 
~~ | {Employee} {Employee} | ~~ = Dahan = 
{ {Employ {Follows} PIOYeey 4 TITLE= Projection TITLE=p-calc 
® { NAME=Carl |< NAME=Dan |) ® 1AUTHOR= Carl 1AUTHOR=Dan 
ene Pe i wee {Cites} viii” od 
ee es ee 0 SS atten a ai ee 60 ee ee 
(c) Vertex join with link discovery. The black dashed edges provide the output of the link creation. 
{Cites} vu 
{User,Paper} ae {User,Paper} ss {User,Paper} 


TITLE= Graphs 


TITLE=Join 


{User,Paper 


} 


{Follows} jj TITLE=OWL TiTLE=OWL 
OO | LAuTHOR=Alice | O®H© | LAuTHOR= Alice LAuTHOR=Bob | @6@ OpG® LAuTHOR=Bob | @6@ 
NAME=Alice ; NAME=Alice NAME= Bob NAME= Bob 
aa \y P as he, 
iii | 2 £ 65 o itt , o 
= CL ule “ule 
& {User,Paper} {User,Paper} {2 {User,Paper} |e 
TITLE= Projection {Follows} iv TITLE=-calc TITLE=Projection {Follows} iv TITLE=-calc 
@G® | 1AvuTHOR=Carl | 1AuTHOR=Dan | ©@® @G® |] 1AvuTHOR=Carl | 1AuTHOR=Dan | ©@® 
NamE=Carl {Cites} viii | Name=Dan NamME=Carl {Cites} viii | NamE=Dan 


(d) Vertex join with vertex fusion. The vertices that were previously linked are 
now fused into one single vertex. 


{Follows,Cites} 


{User,Paper} ..""— {User,Paper} ge {User,Paper} 


(e) Path join over the vertices that have been previously fused via (d) with link 
creation (black dashed edges). 


{User,Paper } 


{Follows,Cites} 


ae 
ae 
* 
aw 


"e, 
*% 


mh {User,Paper} 
TITLE= Graphs TITLE= Join TITLE=OWL TITLE= Graphs TITLE= Join TITLE=OWL 
O@® | LAUTHOR=Alice | OOO | 1AuTHOR=Alice 1AUTHOR=Bob|@®@@ OOO} LAvTHOR=Alice | OPO | 1AuTHOR=Alice 1AUTHOR= Bob | @6@ 
NAME=Alice NAME=Alice NAME=Bob NAME=Alice NAME=Alice NAME= Bob 
oe ae e a 
a o° ey 
Jor s 5 
a ge fs) = z 
{User,Paper } ae {User,Paper} , & {User,Paper} wes {User,Paper} 1 
TITLE= Projection TITLE=p-calc TITLE= Projection x TITLE=p-calc 
O66 | LAvuTHOR=Carl LAUTHOR=Dan|®©6®@ @OG@]| 1LAvTHOR=Carl 1AUTHOR=Dan | ©6® 
NAME= Carl NAME= Dan NAME= Carl {Follows } NAME= Dan 
{Cites} 
(f) Query with conjunctive semantics, es= A: Researcher>s))_, Author Citation (g) Query with disjunctive semantics, es= V: Researcher><, Citation 


Name=1 Author 


Figure 1: Example of a Graph Database containing two distinct graph operands, (a) and (b). Vertices are enumerated with circled 
arabic numbers, while edges are enumerated with roman numbers in italics. Dotted edges remark edges shared between the two 


different joins. 
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instance, the graph equi-join algorithm can be used for investi- 
gating multimodal demand in transportation network analysis [5]. 
We might be interested in joining together the networks of differ- 
ent transportation services, where nodes and edges represent the 
stops and path, respectively, to plan a long journey or balance the 
users’ demand. Edges from different operands might be combined in 
several different ways that, in this paper, we denote as es semantics. 

Temporal networks are an appealing application for graph joins 
by (unique) node id: edges within a “snapshot” graph G; represent 
interactions at a given time t. Nodes might represent users within 
social networks or computing agents, and interactions may rep- 
resent text message or diseases’ spread. We may be interested in 
either selecting all the interactions occurring at every given times- 
tamp or merging such interactions within one single graph. The 
first scenario will use the conjunctive semantics and the second 
one the disjunctive semantics. 

The disjunctive semantics could be also applied to the bibli- 
ographic data scenario for answering the query “For each paper 
reveal both the direct and the indirect dependencies (either there is a 
direct paper citation or one of the authors follows the other one in the 
Researcher Graph)”. The resulting graph (Figure 1g) has the same 
vertex set as Figure 1f exploiting a conjunctive edge semantics, but 
they differ on the final edges. 

Concerning our previous work [3], the main contributions are the 
following: we upgraded the definition of the logical model and sim- 
plified the definition of the join operators in [2], as well as upgraded 
the original GCEA algorithm to also support the disjunctive opera- 
tion. For this novel algorithm, we preserve the original computational 
complexity optimality for the conjunctive case; we also discuss the 
computational complexity of the disjunctive semantics [2]. We ex- 
panded the experiments section, by both considering bigger graphs 
(up to 10° nodes) and more connected datasets (Kronecker Graph). 
Furthermore, we also showed how the basic procedures for joining 
graphs could be easily reused and minimally extended to support 
incremental graph join updates for the conjunctive semantics. 

For our graph join operator, we specifically designed! two sec- 
ondary memory physical models, one for loading the graph operands 
as an adjacency list via primary and secondary indices, and the 
other one for returning the graph join result (§3). We implemented 
a graph equi-join operator for both edge semantics (§4). We show 
that bulk graphs could be also used to efficiently maintain posi- 
tive incremental updates for the graph conjunctive join (§5). After 
outlining the formal computational complexity of both graph join 
semantics as implemented in our algorithm (§6), our benchmarks 
(§7) clearly show that the combination of efficient algorithms and 
an ad-hoc data structure outperform the graph join definition over 
current relational databases (PostgreSQL) in SQL and on graph 
query languages for both property graphs and RDF data (Cypher 
over Neo4J and SPARQL over Virtuoso). Similar conclusions can 
be done for incremental updates. 


2 LOGICAL DATA MODEL 


The term property graph usually refers to a directed, labelled and 
attributed multigraph. Regarding Figure 1, we associate a collection 
of labels to every vertex and edge (e.g., {User} or {Cites}). Further on, 
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vertices and edges may have arbitrary named attributes (properties) 
in the form of key-value pairs (e.g.,{TITLE=Graphs, 1AUTHOR=Alice}). 
Property-value associations of vertices and edges can be repre- 
sented by relational tuples; this is a common approach in literature 
even when graphs have no fixed schema. Relational tuples with 
no conflicting key-value pairs can be merged with the © opera- 
tor. As RDF graphs can be always represented in Property Graphs, 
we use the latter as an underlying logical model of choice for our 
physical model and associated algorithm. Please refer to [2, §II] for 
additional details. 

Due to the lack of space, we moved the formal definitions of 
graph joins alongside the definition of both conjunctive and dis- 
junctive semantics [2, §III]. An operational definition is given in 
§4 where we describe how to compute such operator on top of the 
physical data model. In the same technical report, we show that our 
formal model of choice allows graph joins to be commutative and 
associative operations, thus showing similar properties to relational 
joins and enabling the definition of multi-graph equi-joins. 


3 PHYSICAL DATA MODEL 


Near-Data Processing approaches for Big Data on SSDs [12, 22] pro- 
vide efficient resource utilization in runtime for both read (input) 
and write (output) operations. To meet efficiency on reading opera- 
tions, we need to exploit the principle of locality; this requires the 
usage of linear data structures, thus achieving sequential locality. As 
a consequence, we’re going to represent each graph operand and its 
associated primary and secondary indices as linear data structures. 
Efficiency on write operations is obtained by representing a graph 
adjacency list as fixed-size records; given that this data structure 
contains limited information (i.e., vertices and edges represented 
with their ids) that is going to be later on used for incremental 
updates, we call this data structure “bulk graph”. 


Read Operations: Indexed Graphs Each graph operand is repre- 
sented using three data structures, whose definition are represented 
in Figure 2b: VertexVals represents the adjacency list record associ- 
ating each vertex to its outgoing vertices; each record has a variable 
size, and it is sorted by vertex hash, thus inducing a vertex bucket- 
ing. Such buckets may be accessed in constant time using a primary 
index, HashOffset of fixed-size records. After associating a unique 
id to each vertex v; € V, such vertices can be randomly accessed in 
constant time by using the secondary index VertexIndex. 
VertexVals stores vertices alongside with their adjacency list, 
where the outgoing vertices are sorted by hash value and are repre- 
sented by their unique id and hash value to avoid data replications. 
Each vertex in VertexVals is stored by omitting the vertices’ attribute 
names and serialising only the associated values (val[1]... val[M]). 
The graph equi-join algorithm is going to explicitly rely upon a 
vertex bucketing induced by the selection predicate 0, as well as 
each vertex outgoing edges on the target vertices. Therefore, we can 
achieve the advantages of a sorted hash join iff. the to-be-matched 
vertices are sorted by hash value. If such hash-sorted vertices are 
stored in a linear data structure as in Figure 2b (e.g. an array, Ver- 
texVals), then we could create a hash index HashOffset for all the 
vertices’ join buckets: such index is composed of fixed-size records 
providing both the bucket value and the offset to the first vertex of 
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(b) Data structures used to implement the graph in secondary memory. Each data structure represents a different file. 
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(c) Using the graph schema in Figure 2b for representing G,; in secondary memory. v and v2 belong to a different bucket 


from v; only for illustrative purposes. 


Figure 2: Indexed Graph in secondary memory. 


| (true, ©, ©, 0, 0) | (false, i, v, 0, 0) [(true, ®, ©, 0, 0) | (false, iii, iv, 0, 0) [(true, 8, ®,0, 0) | (true, @, @, 0, 0) | (true, ®,®, 0,0) | 


Figure 3: Representing the bulk graph for the join Researcher>s 


A 


Name=1Author tation depicted in Figure 1f. 


| (true, 0, ©, 9, 0) | (false, i, v, 0, 0) | (false, iii, +, ©, ®) { (true, ®,©, 9, 0) | (false, iii, iv, O, 0) | (false, i,t, ®, ©) |, 
_{(true, ®, ©, 0, 0) | (false, f, viii, ®, ®) | (true, ®,®, 0, 0) | (false, ii, +, ,®) | (false, f, vii, ®, @) | (true, ©, ©, 0, 0) | (false, iv, ¢,®, ®) | 


Figure 4: Representing the bulk graph for the join Researcher>s 


the bucket stored in VertexVals. Given that the vertices in VertexVals 
have variable size, we define a secondary index named VertexIndex 
of fixed-size records, allowing a random accessing to the vertices 
stored in VertexVals in O(1) time: each record is ordered by vertex 
id, has the vertex hash and contains the offset to where the vertex 
data is stored in VertexVals. Figure 2c depicts a serialised representa- 
tion of the graph in Figure 2a: all the labels and the edge values are 
not serialised but are still accessible via id. We access our three data 
structures using memory mapping: the operating system directly 
handles which pages need to be either cached or retrieved from 
secondary memory, thus enabling opaque caching. 


Write Operations: Bulk Graphs The result of the graph join 
algorithm could be too large to fit in the main memory. For this 
reason, some implementations like Virtuoso and PostgreSQL do 
not explicitly store the result unless explicitly required but evalu- 
ate the query step by step. On the other hand, Neo4J sometimes 
fails to provide the final result due to the employment of all the 
available primary memory (Out of Primary Memory, OOM1). To 
store the whole result as fixed-size blocks in secondary memory, 
we chose to implement an ad-hoc graph data structure represented 
as an adjacency list (Figure 3), where each entry represents a re- 
sult’s vertex (in blue, isVertex=true) and, possibly, its adjacent 
edges (in orange, isVertex=false). So, vertices and adjacent edges 
are differentiated by an isVertex boolean flag, so that each block 
(isVertex, Left, Right, 0,0) represents either a vertex or an edge 
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Name=1Author © itation depicted in Figure 1g. 


having an id dt(Left, Right), where Left (Right) comes from the 
left (right) operand and dt is a bijection mapping two ids to one 
single id. 0 slots are only used for disjunctive semantics (Figure 4): 
when an edge coming from one operand is returned, a placeholder 
+ stands for the edge missing from the other operand, and 0 slots 
are filled with additional vertex information for reaching the des- 
tination node in the resulting graph. The information associated 
with each bulk graph vertex and the associated outgoing edges can 
be reconstructed while visiting the bulk graph by accessing both 
operands’ VertexIndex. §5 will show that, given a set of (e.g., two) 
bulk graphs, we can reconstruct an indexed graph. 


4 GRAPH EQUI-JOIN ALGORITHM 


Given that the most common use of join operations only includes 
equality predicates among different tuples [8], we focus on an algo- 
rithm for equi-join predicates 0. This paper extends our previous work 
[3] by providing a novel join algorithm that performs both conjunctive 
and disjunctive semantics. 

The GrApPH EguI-JOIN ALGORITHM in Algorithm 1 consists of 
four main parts: generating the hashing function from the vertex 6 
predicate (Line 40), loading the graph operand G, in primary mem- 
ory as a red-black tree map, and associating vertex id a hash value 
(bucket) alongside the associated outgoing vertices (Line 41), creat- 
ing the secondary memory indices depicted in the previous section 
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from the ordered map data structure (map2, Line 42), and then exe- 
cuting a partition sorted join over the adjacency list representation 
of the serialised operands (Line 44). 

First, we infer the hashing function h from @: if 0(u, v) is a binary 
predicate between distinct attributes from u and v, h is defined 
as a linear combination of hash functions over the attributes of 
either u or v. We implemented in C++ the Java hashing for lists as 
h(x) = 3-171 4 ae 17' -h(x;), where h is a hashing function for 
each possible value x;. If no function could be inferred from 9, h is 
a constant function returning a default hashing value (e.g., 3). 

LOADING performs a vertex bucketing for graph operand Gy in 
main memory: its outcome is a vertex set stored as an ordered multi- 
map mapx implemented over red-black trees, where each vertex uv 
is stored in a collection map, [h(v)] , where h is the aforementioned 
hashing function. Before doing so, v’s adjacency list v.out is sorted 
by target vertex hash value. 

INDEXING stores the map, in secondary memory as the read 
graph in §3: as the vertices are sorted by hash value via the or- 
dered multi-map, the serialization is sequential and straightfor- 
ward. Please observe that the outgoing edges are sorted by the 
target node’s hash. 

The last step performs the actual conjunctive join over the se- 
rialised graph (Jorn); let us now discuss the conjunctive join se- 
mantics (es= A): both operands’ data structures are accessed from 
secondary memory through memory mapping. Line 43 provides 
the buckets’ intersection via HashOffset: while performing a linear 
scan over the sorted buckets in such primary index, the iteration 
checks if both operands have a bucket with the same hash value k 
is extracted. When this happens, the two buckets can be accessed 
jointly (Line 15). If there are two distinct vertices v and v’, one 
for each operand, satisfying 0, a newly joined vertex is created by 
merging those as v@v’ (Line 17). Next, unlike the relational join, we 
also visit the adjacent vertices for both operands. Similarly to Line 
43, the hash-sorted edges e and e’ having v and v’ as source nodes 
induce a bucketing (Line 23 and 24), and for each hash-matching 
edge bucket, we check if the destination vertices (e.dst, e’.dst) meet 
the join conditions alongside with the to-be-joined edges (Line 31). 
Please note that edges are not filtered by 0 predicate, but only com- 
bined according to specific semantics. The design of the previously 
described sorted bucket-based scan also permits the reduction of 
the number of page faults, since all the vertices with the same hash 
value are always stored in a contiguous block in VertexVals. The 
resulting graph is stored in a bulk graph (Figure 3) where only 
the vertices id from the two graph operators appear as pairs. We 
can show that the algorithm performs a Cartesian product over 
both vertex sets and both edge sets in the worst-case scenario and 
performs in quasi-linear time when a “good” hashing function is 
found, therefore showing a good theoretical time complexity: 


LEMMA 4.1. Given two graph operands Gg and Gp and a @ bi- 
nary operator, the conjunctive algorithm runs in time T, (Gg, Gp) € 
O(|V(a)||V (b)||E(a)||E(b)|) in the worst-case scenario, and is 


Tr (Ga; Gp) € O([E(a)UE(b)|+|V (a) log |V (a)|+IV (0) log |V(b)]) 


in the best case scenario (proof at [2, §IV]). 


When es= V, the algorithm runs with the disjunctive semantics: 
all the edges discarded from the intersection for v @ v’ within the 
conjunctive semantics at Line 30 should be now considered (Line 33), 
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Algorithm 1 Graph Equi-Join Algorithm 


Input: Two graph operands Gg Gp, an Equi-Join 6, and es 
semantics 

Output: Gy no Gp 

Global: h, BI, map?, mapy, 0 


function DisyuNcTION(u, u’, Ey, Er, k’, BulkGraph) 
2: if k’ € BI then 
fore € Ey andv’ € map’, [k’ | do 


4: if 0(e.dst, v’) then 
€ :=new Edge(u’, v’) 
6: BulkGraph.E < e ®e€ 


for v € map2[k’] ande € Er do 


8: if 0(v, e.dst) then 
€ :=new Edge(u, v) 
10: BulkGraph.E < € Ge 


12: function Jomn(map2, map, isDisj) 
BulkGraph = open_write_file 
14: for each bucket k € BI do 
for (v,v') € map? [k] - map’, |k] do 


16: if 0(v,v’) then 
u:=v@v’; BulkGraphv <— u 
18: U =90 
I := Keys(v.map) N Keys(v’.map) 
20: if isDisj then 
U := Keys(u.map) U Keys(u’.map) 
22: else 
U :=I 
24: for k’ € U do 
if isDisj then 
26: Ey := EdgeSet(v.map[k’ |) 
Er := EdgeSet(v’.map[k’ ]) 
28: if sisDisj V k’ € I then 
for (e, e’) € u.map[k’] - v’.map[k’ | do 
30: if 0(e.dst, e’.dst) then 
BulkGraphE < e @ e’; 
32: if isDisj then Ey, := Ez\{e}; Er := Er\{e’}; 


if isDisj then DisyUNCTION(¥, v’, Er, Er, k’, Result) 
34: else 

DISJUNCTION(¥, v’, Ey, Er, k’, Result) 
36: return BulkGraph 


38: function GEJA(Gg, Gp, 0, es) 
Setting @ as a global parameter 
40: h< H(0,Ap,A\,) 
mapa < Loading(h, Gg); mapp < Loading (h, Gp) 
42: map* < Indexing(mapa); map%, <— Indexing(mapp); 
BI == Keys(map;,) M Keys(map;) 
44: return Jorn(map?, map}, es = V); 


Algorithm 2 Graph Conjunctive Equi-Join With Incremental Up- 
date 

Input: Operands (Gr U Gr*) Gg and an Equi-Join 0 

Output: (Gr U Gr*) Dah Gs 

Global: BI, map%, map’, 7 


function CoNnJUNCTIVEINCRUPD(GpR U Gpr*, Gs, 8) 

2; h<— H(0,A,,A’,) 
> GJEA: creating the materialized view and global initialization 

4: M < GJEA(Gp, Gs, 9, A) 
Load <— Loading(h, Gr*); Idx — Indexing(Load) 
BI’ := Keys(Load) N Keys(map; ) 
Join := Jom (Idx,map;,, BI’, A) 
8: return Loading» (x +> 0, MUJoin) 


ot 


if they come either from the left operand (Line 3) or from the right 
one (Line 7). Ey and Ep sets contain the aforementioned discarded 
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edges, that are then going to be considered by the DIsJUNCTIVE 
function. Among all such edges, we can run our disjunctive join 
first over the ones coming from the left operand (Line 3): since the 
final edges must only connect vertices belonging to the final vertex 
set, we consider only those e that have a destination vertex “e.dst” 
which hash value appears in HashIndex (Line 2). Moreover, it has to 
satisfy the binary predicate 0 jointly with another vertex v’ coming 
from the opposite operand (Line 4). Hence, we establish an edge 
e ® € having the same values and attributes of e and the same set of 
labels; then, such edge is written in the bulk graph (Lines 5 and 6). 
Similar considerations should be done by the edges from the right 
operand (Lines 7-10). 

As we can see from DisjuNCTION, the disjunctive semantics 
potentially increases the final number of the edges while keeping 
the vertex set size unchanged, as requested (proof at [2, §IV]): 


LEMMA 4.2. With respect to the time complexity, the best (worst) 
case scenario of the disjunctive semantics is asymptotically equiva- 


lent to the conjunctive semantics in its best (worst) case scenario, i.e. 


Tr Gal, |Gol) _ 
lim sup (jg, |, |Gp |) (+00,+00) —TGGa = 1, under the same algo- 


rithmic conditions, where |G,| is a shorthand for |V (x)| + |E(x)|. 


5 INCREMENTAL UPDATES 


Current literature defines graph incremental queries only over graph 
traversal queries returning a subgraph of the original graph operand, 
and do not involve the creation of novel graphs as an output [25]. 
Similarly, graph query caching systems such as [31] mainly involve 
graph pattern matching queries where no graph transformations are 
involved. We then propose a positive incremental update algorithm 
for graph conjunctive joins. 

To motivate the adoption of such a general approach to all the 
current graph databases, we want to show that graph join opera- 
tions can be implemented alongside materialised views. This re- 
quirement is crucial for Big Data as it guarantees that, when graphs 
are updated within the database, we don’t need to recompute the 
join operation from scratch. To meet this goal, we implement graph 
Equi-Joins’ incremental updates similarly to how materialized views 
are updated in relational databases (RDBMS). Let us briefly state 
how those refresh materialized views over binary joins work: let 
us suppose to have two relationships, R and S, where tuples are 
added (R*, St) or removed (R~, S~). We can now express the final 
updated relationships as RUR™\R™ and SUS*\S~ respectively. Con- 
sequently, if we want to update the materialised view M = R ><g S 
then intuitively” we need to (i) remove the join between the deleted 
tuples from the (previously computed) materialised view, (ii) join 
the tuples added in the second operand with the left operand minus 
the tuples removed from it (iii) and vice-versa, (iv) join only the 
newly added tuples together, and finally (v) update M as the union 
of all the previous computations. This equivalence is known at least 
from the late nineties [30]. In this current paper, we investigate 
the positive incremental updates, i.e. when data is added and not 
removed from the operands (R” = S” = @). If we also narrow down 
the problem on updating just one operand (e.g., the left one R* # Q, 
and S* = Q), then the whole problem boils down to computing 
M U (Rt pag S). This requires extending the materialized view M 


— formally, (M\(.R™ ><g S~))U((R\R-) ag ST)UCRt pag (S\S~))U (Rt eg 
St 
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by 0-joining the left increment with the whole right operand. As a 
consequence, instead of re-indexing the whole left operand, that is 
RU R*, we can index only R* before performing the theta join of 
such operator against R. We can formally prove that similar results 
can be directly mimicked by the graph conjunctive join, where 
relationships in the former equation are directly replaced by graph 
operands. 

Our paper’s bibliography motivates this restriction for temporal 
social network scenarios, where we can assume — as some other 
authors before us [16, 34] — that novel information for Big Data 
is always added and never removed. Within this paper, we’re 
only going to consider the positive incremental updates for the 
conjunctive semantics (Algorithm 2). Albeit this use case might 
seem too narrow, we must remind the reader that, as shown in 
the introduction, the conjunctive semantics appears already as 
an essential operation for many different relevant use cases, as 
well as showing that we can reassemble all the primary graph 
routines from Algorithm 1 for expressing a novel operation, the 
Graph Conjunctive Equi-Join with Incremental Updates. Even in 
this case, our goal will be to show that such novel operations are 
going to outperform refresh operations over relational databases, 
given that current graph databases haven’t query update operations 
as such (§7). 

After creating the bulk graph for the intermediate representation 
M := Gr > Gg (Line 4), we want to update such result by con- 
sidering the increment Gp’. As it happens in relational databases, 
we load just Gr* and then index it (Line 5) and then run a slightly 
different version of the JoIn algorithm (Line 7): in fact, we must 
consider that the first vertex in Gg* does not necessarily start from 
index 0, and we must take into account that not all the vertices are 
represented. Then, we need to take the two bulk graphs and create 
a materialized view out of them, which is an indexed graph. For 
graphs, graph unions can be implemented as the vertex and edge 
set unions [15]: given that the bulk graphs are just an adjacency list, 
we need to scan the two files and merge the two adjacency lists in 
primary memory (MUJoin). We save the final result as an Indexed 
Graph via a slight edit of the Loading phase (Loading) where we 
both combine the bulk graph vertices (uj,v;) as one single vertex 
uj ®v;, and we create an indexed graph not from secondary memory 
data, but from the primary-memory loaded bulk graph (Line 8). 


6 RELATED WORK 


Discrete Mathematics. Every graph product of two graphs [13] 
produces a graph whose vertex set is the cross product of the 
operands’ vertex sets, thus creating a set of vertex pairs; the edge 
set computation changes according to the different graph product 
definition. Please observe that vertex pairs differ from the relational 
algebra’s cartesian product outcome, where the two vertices are 
combined via ®. Nevertheless, such operations do not take into 
account the naming of the resulting table, which in graph databases 
could be assimilated to the vertices’ (and edges’) labels. [33] pro- 
vided a Kronecker Product for edge uncertainty, even though no 
information on how to combine the data associated with either ver- 
tices or edges is given: the authors implemented a matrix operation 
within a relational database. Conjunctive semantics implements 
Kronecker Products for graph joins. Last, the graph conjunctive 
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equi-joins of two node-labelled automata A and B provides a new 
automaton C, where vertices u @ v are initial (or accepting) states 
iff. both u and v are initial (or accepting) states in both operands. 
By denoting £[C] as the language generated by an automaton C, 
we can show that L[C] = L[A] N L[B] [1]. 


Graph Similarity. Kronecker graph products have also a more 
practical application in the computation of graph kernels. Graph 
kernels allow mapping graph data structures to feature spaces (usu- 
ally a Hilbert space in R” for n € N30) [23] so to express graph 
similarity functions that can then be adopted for both classification 
[28] and clustering algorithms. Kronecker (graph) products were 
used first to generate one single directed graph, over which the 
summation of the weighted walks provides the desired graph ker- 
nel result [23]. Kronecker graph products are also used in current 
literature to generate the minimal general generalization of two 
graphs. Such an intermediate graph is then used within a graph 
edit-distance [6] that can be expressed as a kernel function because 
it is a proper metric. Given that all such techniques require a Kro- 
necker graph product that can be expressed via a conjunctive join, 
our conjunctive join algorithm reveals to be beneficial for efficiently 
computing those metrics over big graphs. 

Schema alignment, data similarity and integration techniques 
relying on the similarity flooding algorithm [21] require the creation 
of a “pairwise connectivity graph” (PCG) from two different graphs 
(e.g., graph schemas), namely A and B. Such a graph is the outcome 
of a conjunctive graph join: 


((u ®v), label, (u’ @v’)) € Epcg &(u, label, u’) € Ega 
(v, label, v’) € Ep 


Given that the same algorithm could be also applied to a variety 
of different scenarios such as (graph) schema alignment, matching 
semistructured schemas with instance data as well as finding simi- 
larities among data instances via self graph joins [21], an efficient 
implementation of the graph conjunctive algorithm will be also 
beneficial to these disparate scenarios. 

Please note that none of the aforementioned approaches pro- 
vided an efficient implementation of the required graph product, as 
the graph product was just a pre-processing step required to solve 
the main problem and not the target of the research question. 


Relational Databases. Walking in the footsteps of current litera- 
ture, we can represent each graph operand as two distinct tables, 
one listing the vertices and the other listing the edges [2, Appen- 
dix C] In addition to the attributes within the vertices’ and the 
edges’ tables, we assume that each row (on both vertices and edges) 
has an attribute id enumerating vertices and edges. Concerning 
SQL interpretation of such graph join, we first join the vertices. 
The edges are computed through the join query provided in [2, 
Appendix C]: the root and the leaves are the results of the @ join 
between the vertices, while the edges appear as the intermediate 
nodes. An adjacency list representation of a graph, like the one 
proposed in the current paper, reduces the joins within the relation 
solution to one (each vertex and edge is traversed only once), thus 
reducing the number of required operation to create the resulting 
graph. On the other hand, graphs as secondary memory adjacency 
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lists enable efficient distributed algorithms for graph traversals [18]. 


Graph Query Languages. Different graph query languages sub- 
sume different graph database management system architectures: 
while SPARQL allows access to multiple graphs resources via named 
graphs, Cypher stores only one graph per database at a time, and 
hence do not naturally support binary graph join operations. Differ- 
ent graph operands might be still expressed via distinct connected 
components that are potentially associated with different labels. 
Graph query languages such as Cypher and SPARQL do not provide 
a specific keyword to compute the graph join between two graphs. 
As a consequence, the graph join must be written as an explicit 
query composed of several different clauses. Concerning Cypher, 
the CREATE clause has to be used to generate new vertices and 
edges from graph patterns extracted through the MATCH. . .WHERE 
clause and intermediate results are merged with UNION ALL. By 
analyzing the associated query plan, we observed that the query 
language requires us to traverse the same graph five times, thus 
visiting the same graph database and all the associated sub-patterns 
multiple times. On the other hand, SPARQL reduces the number of 
the distinct traversed patterns to two, but the need of invoking an 
OPTIONAL clause makes the overall query plan inefficient because 
all the certain paths are left joined with the optional ones. At this 
point, the CONSTRUCT clause is required if we want to finally com- 
bine the traversed paths from both graphs into a resulting graph. 
The construction of new graphs is not part of the algebraic opti- 
misation of the SPARQL queries, which are originally designed to 
return tables; this introduces additional computational overhead. 
This limitation is also shared with Cypher. 


7 EXPERIMENTAL RESULTS 


Our experiments were performed on top of two datasets*: we used a 
real social network dataset (Friendster [32]) for simulating a sparse 
graph, and we used the SNAP Kronecker graph generator krongen 
for simulating real-world transport and communication networks 
with heavy-tailed degree distributions [19]. Given that both datasets 
do not come with vertex value information, we enriched those using 
the guidelines of the LDBC Social Network Benchmark protocol. We 
associated each vertex (user-id) a name, surname, e-mail address, 
organization name, year of employment, sex, and city of residence. 

We performed our tests over a Lenovo ThinkPad P51 with a 
3.00 GHz (up to 4.00 GHz) Intel Xeon processor and 64 GB of RAM 
up to 4.000 MHz. The tests were performed over an SSD with an 
ext4 (GNU/Linux) File System with a free disk space of 50GB. We 
evaluate the graph join using as operands two distinct sampled sub- 


graphs with the same vertex size (|V|), where the @ predicate is the 


d 
following one: 0(u, v) ef u.Yearl = v.Year2 A u.Organization1 = 


v.Organization2. Such predicate does not perform a perfect 1-to-1 
match with the graph vertices, thus allowing to test the algorithm 
with different multiplicities values. We performed the same join 
query over the two different datasets for both semantics. 


3Please refer to [2, Appendix B] for an in-depth explanation of the limitations of such 
languages. In this technical report, we also provide the original benchmark queries for 
reproducibility purposes. 

“https://osf.io/xney5/ 
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We used the default configurations for Neo4J 3.5.0 and Post- 

greSQL 10.6, while we changed the cache buffer configurations 
for Virtuoso 7.20.3230 (as suggested in the configuration file) for 
64 GB of RAM; we also kept the default multi-threaded query exe- 
cution plan. PostgreSQL queries were evaluated through the psql 
client and benchmarked using both explain analyse and \timing 
commands; the former allows to analyse SQL’s query plans. Virtu- 
oso was benchmarked by using unixODBC connections; SPARQL’s 
associated query plan was analysed via Virtuoso’s profile state- 
ment. Cypher queries were sent using the Java API but the graph 
join operation was performed only in Cypher through the execute 
method of an GraphDatabaseService object. 
GraPH Egut-JoIn. For each distinct dataset, we generate two 
graphs collections for both the left and the right operand. Each 
collection was generated by random walk sampling each given 
dataset starting from the same vertex, and storing each graph once 
of a number of the traversed vertices is a power of 10, from 10 to 
10°. The only parameter that was changed for generating distinct 
collections for the left and right operand’s collections was the 
graph traversal seed. Overall, the sampling approach guarantees 
that each left and right operand subgraph always share at least one 
single vertex and that each operand approximates the graph degree 
distribution of the original data set. Furthermore, each left (right) 
operand of size 10’ is always a subgraph of all the operands 10'*” 
forn > 1andi+n < 8. For Friendster, we started the random walk 
from one of the hub nodes, while we randomly picked it for the 
other dataset. 

The experiments show that our approach outperforms both 
graph equi-join semantics on different current query languages 
interpretation for both graph and relational databases. We consider 
the time to (i) serialise our data structure (Loading) and (ii) eval- 
uate the query plan (Indexing and Joining). This twofold analysis 
is required because, in some cases, the costly creation of several 
indices may lead to better query performance. Given that our graph 
join result is represented as a bulk graph where no vertex and edge 
information is serialised, all the queries fed to our competitors are 
designed only to return the vertex and the edge id that are matched. 
The lack of ancillary data attached to either vertices or edges al- 
lows better comparison of query evaluation times, which are now 
independent of the values’ representations and tailored to evaluate 
both the access time required for the loaded operator and returning 
the joined graph. 

Table 1 provides a comparison for the actual loading, indexing 
and joining running time for our conjunctive and disjunctive graph 
join implementation. This evaluation confirms that the actual con- 
junctive joining time is negligible, while the whole computational 
burden is carried out by both the operand loading and the indexing. 
On the other hand, the disjunctive join task dominates over the 
loading phase for bigger graphs. Moreover, more dense graphs are 
affected more by the indexing time than by the loading time. The ex- 
perimental evaluation also confirms that the disjunctive semantics 
contains a superset of operations with respect to the conjunctive 
one, and therefore its computation takes more time than the dis- 
junctive one. We also analysed the main memory consumed by our 
algorithm using malloc_stats and the Ubuntu System Monitor, 
and we might observe that our algorithm might take at most 22GB 
of primary memory. 
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Figure 5: Loading time over the enriched Friendster and Kro- 
necker Datasets. 1H=3.60-10° s 
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Figure 6: Graph Equi-Joins over the enriched Friendster 
Dataset: (Indexing and) Join Time. We remark the bench- 
marks that exceeded the one hour threshold with >1H, as 
well as the algorithms failing to compute due to either pri- 
mary (OOM1) or secondary (OOM2) out of memory error. 


Figure 5 [2, Table I] compares the loading time for both datasets 
for our solution as well as for all the competitors. Please note that 
the loading time in PostgreSQL corresponds to loading the vertices’ 
and edges’ tables for both operands, in Virtuoso it corresponds to 
the creation of two distinct named graphs and in Neo4J corresponds 
to the creation of two distinct connected components within one 
single graph database. Globally, graph databases are less performant 
than relational databases: concerning the loading phase, Virtuoso 
creates triple indices for fastening graph traversal operations, while 
in Neo4J an edge creation always requires the checking whether 
two vertices already exist. PostgreSQL is not affected by this com- 
putational burden, given that the graph representation is tabular. 
Given that our proposed solution also requires linking vertices 
with edges, PostgreSQL outperforms our solution for loading big 
operands while our implementation is faster for smaller datasets. 
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Table 1: Graph Conjuntive and Disjunctive Equi-Join: Separating Loading, Indexing and Joining time for the graph join 
algorithm, thus comparing the theoretical computational complexity and the empirical evaluation. Please note that the Loading 
and Indexing time are the same for both semantics. 
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Figure 7: Graph Equi-Joins over the enriched Kronecker 
Graph Dataset: (Indexing and) Join Time. 


Figure 6 [2, Table II] and 7 [2, Table III] compare the conjunctive 
and disjunctive implementation to the competitor’s solutions for 
the Friendster (and Kronecker) dataset. For both PostgreSQL and 
Neo4J, the (join) indexing happens during the query evaluation, 
and therefore we compare the sum of our indexing and joining 
time to the competitors’ query evaluation. On the other hand, our 
proposed indexing and join time always outperforms the competi- 
tors’ query plans by at least one order of magnitude, even when the 
join algorithm cost is considered as the sum of both loading and 
indexing+joining. Please also note that the Virtuoso query engine 
rewrites the SPARQL query into SQL and, as a result of this, two 
SQL queries were performed in both Virtuoso and PostgreSQL. 

The disjunctive Equi-Join algorithm always runs faster than the 
best performer (PostgreSQL)’s conjunctive Equi-Join, thus remark- 
ing the effectiveness of our implementation independently from the 
semantics of choice. Furthermore, all the implementations of the 
disjunctive semantics suffer from lower performances as expected; 
the reason is twofold: both the for-all statement requires for SQL 
and Cypher an aggregation that has already been proved to be 
inefficient [4], jointly with an increase of the patterns that need to 
be visited and that the current languages’ query plan does not allow 
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also shows to be more memory efficient if compared with other 
implementations: producing the bulk graph for the 10° operands 
from the Kronecker dataset takes more than 1 H after writing more 
than 50GB in secondary memory while producing the same bulk 
graph for 10’ operands required only 30GB of free disk space. On 
the other hand, PostgreSQL consumed all the remaining secondary 
memory already with the 10° operands. 

As a result, the inefficiency of the graph query languages is due 
to both the need to repeatedly traverse the same data over multiple 
patterns with no query plan optimisation and to the fact that such 
query languages are not optimised to return one graph as a result. 
On the other hand, the SQL interpretation of the graph conjunctive 
equi-join reduces the number of comparisons, thus resulting in 
more performative than the other relational databases. Still, our 
proposed solution of representing a graph as an indexed adjacency 
list reduces the overall number of comparisons and hence reduces 
the computation time of at least one order of magnitude over bigger 
datasets. 


INCREMENTAL UppatEs. Within this paper, we consider positive 
incremental updates for graph conjunctive Equi-Join. We chose the 
biggest operands of the Kronecker Graph Dataset where all the 
competitors took less than one hour to compute (10* for the graph 
conjunctive join) as the basic input for M = R ><g S. Then, only 
for the left operand, we extend the former series via random graph 
samples with additional 104 +x - 10° operands for x € {1, 2,5, 7, 9} 
such that the difference of graph 10* +x - 10° with graph 10? will 
become our R*. 

Given that Neo4J (and Virtuoso) does not support creating mate- 
rialized views over Cypher (and SPARQL) queries and considering 
that PostgreSQL is the fastest competitor, we restrict our analysis 
on PostgreSQL. After loading the main operands and storing their 
graph join as two materialized views (one for the resulting vertices 
and the other for the resulting edges), we load only R* (Copy) and 
we run REFRESH MATERIALIZED VIEW for both the vertex and the 
edge table representing the graph conjunctive Equi-Join result. Sim- 
ilarly, for benchmarking our operator we provide the following 
operations: after computing M = R >< S with the usual graph con- 
junctive join algorithm, we only load the R* graph (Load), then we 
run the indexing and joining algorithm for R* >< S (Idx+Join), and 
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Table 2: Graph Conjunctive Join and Bulk Merging vs. Post- 
greSQL’s Materialized Views 


Proposed PostgreSQL 
|R*|/|R| | Load (s) | Idx+Join+Mat (s) | Copy (s) Refresh (s) 
10% 2.39:10°-*  3.43-10- 1.50-10°* 3.26-107! 
20% 4.73-:10°-* 6.38-10~? 2.04:10°2  3.58-107! 
50% 1.16-10°! 1.57-1072 2.42:10°  4.53-107! 
70% 1.77:10°! 225-1072 4.89-:10°*  4.60-107! 
90% 1.82:10°! 2.80-1072 5.82-10°-2 _7.87-107! 


finally, we load the two bulk graphs in primary memory and then 
store it using Loading (Mat). Table 2 provides the result of such 
evaluation: while PostgreSQL still is more performant on loading 
the data, our implementation providing indexed graphs from bulk 
graphs proves to be more efficient than the competitor, thus validat- 
ing the overall bulk graph approach for jet another scenario. This 
outcome clearly shows that bulk graphs can be efficiently recon- 
structed as indexed graphs with a small overhead. Last, we showed 
that incremental update can be efficiently implemented without 
ad-hoc indexing data structures by exploiting algebraic rules. 


8 CONCLUSIONS 


We propose a graph equi-join algorithm for the conjunctive and 
disjunctive semantics, outperforming the implementations on dif- 
ferent data representations and query languages. Future works 
will provide extensive benchmarks for incremental updates when 
tuples are either added or removed in both operands. A novel al- 
gorithm for the disjunctive semantics is also going to be provided 
for such incremental updates. The bucketing approach would lead 
to a straightforward concurrent implementation of our join algo- 
rithm, where each process might run the join over a partition of the 
buckets. After partitioning the graph operands using a heuristic of 
choice, we can also exploit the incremental update algorithm for a 
distributed computation of graph joins over Big Data graphs. Last, 
we also carried out experiments to make the conjunctive semantics 
support the < predicate in @ over one attribute having partially 
ordered values: since our data structure is already ordered by hash 
value, we could use a monotone hashing function h with respect to 
such attribute. Such contribution as well as the implementation of 
left-, full-, and materialized- join algorithms will be addressed in 
our future works. 
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ABSTRACT 


Using data warehouses to analyse multidimensional data is a sig- 
nificant task in company decision-making. The need for analyzing 
data stored in different data warehouses generates the requirement 
of merging them into one integrated data warehouse. The data 
warehouse merging process is composed of two steps: matching 
multidimensional components and then merging them. Current 
approaches do not take all the particularities of multidimensional 
data warehouses into account, e.g., only merging schemata, but 
not instances; or not exploiting hierarchies nor fact tables. Thus, 
in this paper, we propose an automatic merging approach for star 
schema-modeled data warehouses that works at both the schema 
and instance levels. We also provide algorithms for merging hierar- 
chies, dimensions and facts. Eventually, we implement our merging 
algorithms and validate them with the use of both synthetic and 
benchmark datasets. 
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1 INTRODUCTION 


Data warehouses (DWs) are widely used in companies and organi- 
zations as an important Business Intelligence (BI) tool to help build 
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decision support systems [9]. Data in DWs are usually modeled in 
a multidimensional way, which allows users to consult and analyze 
the aggregated data through multiple analysis axes with On-Line 
Analysis Processing (OLAP) [14]. In a company, various indepen- 
dent DWs containing some common elements and data may be 
built for different geographical regions or functional departments. 
There may also exist common elements and data between the DWs 
of different companies. The ability to accurately merge diverse 
DWs into one integrated DW is therefore considered as a major 
issue [8]. DW merging constitutes a promising solution to provide 
more opportunities of analysing the consistent data coming from 
different sources. 

A DW organizes data according to analysis subjects (facts) asso- 
ciated with analysis axes (dimensions). Each fact is composed of 
indicators (measures). Finally, each dimension may contain one or 
several analysis viewpoints (hierarchies). Hierarchies allow users 
to aggregate the attributes of a dimension at different levels to 
facilitate analysis. Hierarchies are identified by attributes called 
parameters. 

Merging two DWs is a complex task that implies solving several 
problems. The first issue is identifying the common basic compo- 
nents (attributes, measures) and defining semantic relationships 
between these components. The second issue is merging schemata 
that bear common components. Merging two multidimensional 
DWs is difficult because two dimensions can (1) be completely iden- 
tical in terms of schema, but not necessarily in terms of instances; 
(2) have common hierarchies or have sub-parts of hierarchies in 
common without necessarily sharing common instances. Likewise, 
two schemata can deal with the same fact or different facts, and 
even if they deal with the same facts, they may or may not have 
measures in common, without necessarily sharing common data. 

Moreover, a merged DW should respect the constraints of the 
input multidimensional elements, especially the hierarchical rela- 
tionships between attributes. When we merge two dimensions hav- 
ing matched attributes of two DWs, the final DW should preserve 
all the partial orders of the input hierarchies (i.e., the binary aggre- 
gation relationships between parameters) of the two dimensions. 
It is also necessary to integrate all the instances of the input DWs, 
which may cause the generation of empty values in the merged DW. 
Thus, the merging process should also include a proper analysis of 
empty values. 


232 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


In sum, the DW merging process concerns matching and merging 
tasks. The matching task consists in generating correspondences 
between similar schema elements (dimension attributes and fact 
measures) [4] to link two DWs. The merging task is more complex 
and must be carried out at two levels: the schema level and the 
instance level. Schema merging is the process of integrating several 
schemata into a common, unified schema [12]. Thus, DW schema 
merging aims at generating a merged unified multidimensional 
schema. The instance level merging deals with the integration and 
management of the instances. In the remainder of this paper, the 
term “matching” designates schema matching without considering 
instances, while the term “merging” refers to the complete merging 
of schemata and corresponding instances. 

To address these issues, we define an automatic approach to 
merge two DWs modeled as star schemata (i.e., schemata contain- 
ing only one fact table), which (1) generates an integrated DW 
conforming to the multidimensional structures of the input DWs, 
(2) integrates the input DW instances into the integrated DW and 
copes with empty values generated during the merging process. 

The remainder of this paper is organized as follows. In Section 
2, we review the related work about matching and merging DWs. 
In Section 3, we specify an automatic approach to merge different 
DWs and provide DW merging algorithms at the schema and in- 
stance levels. In Section 4, we experimentally validate our approach. 
Finally, in Section 5, we conclude this paper and discuss future 
research. 


2 RELATED WORK 


DW merging actually concerns the matching and the merging of 
multidimensional elements. We classify the existing approaches 
into four levels: matching multidimensional components, matching 
multidimensional schemata, merging multidimensional schemata 
and merging DWs. 

A multidimensional component matching approach for matching 
aggregation levels is based on the fact that the cardinality ratio of 
two aggregation levels from the same hierarchy is nearly always the 
same, no matter the dimension they belong to [3]. Thus, by creating 
and manipulating the cardinality matrix for different dimensions, 
it is possible to discover the matched attributes. 

The matching of multidimensional schemata directs at discover- 
ing the matching of every multidimensional components between 
two multidimensional schemata. A process to automatically match 
two multidimensional schemata is achieved by evaluating the se- 
mantic similarity of multidimensional component names [2]. At- 
tribute and measure data types are also compared in this way. The 
selection metric of bipartite graph helps determine the mapping 
choice and define rules aiming at preserving the partial orders of 
the hierarchies at mapping time. Another approach matches a set 
of star schemata generated from both business requirements and 
data sources [5]. Semantic similarity helps find the matched facts 
and dimension names. Yet, the DW designer must intervene to 
manually identify some elements. 

A two-phase approach for automatic multidimensional schema 
merging is achieved by transforming the multidimensional schema 
into a UML class diagram [7]. Then, class names are compared and 
the number of common attributes relative to the minimal number 
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of attributes of the two classes is computed to decide whether two 
classes can be merged. 

DW merging must operate at both schema and instance levels. 
Two DW merging approaches are the intersection and union of 
the matched dimensions. Instance merging is realized by a d-chase 
procedure [15]. The second merging strategy exploits similar di- 
mensions based on the equivalent levels in schema merging [11]. It 
also uses the d-chase algorithm for instance merging. However, the 
two approaches above do not consider the fact table. Another DW 
merging approach is based on the lexical similarity of schema string 
names and instances, and by considering schema data types and 
constraints [8]. Having the mapping correspondences, the merging 
algorithm takes the preservation requirements of the multidimen- 
sional elements into account, and is formulated to build the final 
consolidated DW. However, merging details are not precise enough 
and hierarchies are not considered. 

To summarize, none of the existing merging methods can satisfy 
our DW merging requirements. Some multidimensional compo- 
nents are ignored in these approaches, and the merging details of 
each specific multidimensional components is not explicit enough, 
which motivates us to propose a complete DW merging approach. 


3 PRELIMINARIES 


We introduce in this section the basic concepts of multidimensional 
DW design [13]. The multidimensional DW can be modelled by a 
star or a constellation schema. In the star schema, there is a single 
fact connected with different dimensions, while the constellation 
schema consists of more than one fact which share one or several 
common dimensions. 


Definition 3.1. A constellation denoted C is defined as (N C FS, 
D©, Star©) where N© is a constellation name, F© = {F©,..., FS} 
is a set of facts, D© = {D°, a Do} is a set of dimensions, Star© : 


Cc ; Rade . ; ; 
FC — 2?” associates each fact to its linked dimensions. A star is a 
constellation where F© contains a single fact; ie. m = 1. 


A dimension models an analysis axis and is composed of at- 
tributes (dimension properties). 


Definition 3.2. A dimension, denoted D ¢€ DC is defined as 
(NP, AP, HP, TP) where NP isadimension name, AP = fae. ay aD} 
Uf{idP } is a set of attributes, where idP represents the dimension 
identifier, which is also the parameter of the lowest level and called 
the root parameter. HP = {HP : vy HP } is a set of hierarchies, 
yP = lie seks is } is a set of dimension instances. The value of the 

D -D _D 


instance i for an attribute a, is annotated as i, .a;, 


Dimension attributes (also called parameters) are organised ac- 
cording to one or more hierarchies. Hierarchies represent a partic- 
ular vision (perspective) and each parameter represents one data 
granularity according to which measures could be analysed. 


Definition 3.3. A hierarchy of a dimension D, denoted H € 
H» is defined as (Nt Param! ) where NE isa hierarchy name, 
Param" =< id?, pi ,... pil > is an ordered set of dimension at- 
tributes, called parameters, which represent useful graduations 
along the dimensions, Vk € [1...v], pi € AP. The roll up rela- 


tionship between two parameters can be denoted by pl <H pi 
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H we have 


for the case where pl roll up to pi in H. For Param 
idP <r pl ,pi XH P> ee ee <y p'!. The matching of multidi- 
mensional schemata is based on the matching of parameters, the 
eee ey amie two parameters of two hierarchies 


ee ‘and p;;” is denoted as pi ~ pe 


A sub-hierarchy is a continuous sub-part of a hierarchy which 
we call the parent hierarchy of the sub-hierarchy. This concept will 
be used in our algorithms, but it is not really meaningful. So a sub- 
hierarchy has the same elements than a hierarchy, but its lowest 
level is not considered as "id". All parameters of a sub-hierarchy are 
contained in its parent hierarchy and have the same partial orders 
than those in the parent hierarchy. "Continuous" means that in the 
parameter set of the parent hierarchy of a sub-hierarchy, between 
the lowest and highest level parameters of the sub-hierarchy, there 
is no parameter which is in the parent hierarchy but not in the 
sub-hierarchy. 


Definition 3.4. A sub-hierarchy SH of H € H? is defined as 
(N°, Param®"*) where N° is a sub-hierarchy name, Param> = 
< oe ,.., p24 > is an ordered set of parameters, called parameters, 


Vk € [1...0], pi € Param". According to the relationship between 


a sub- oo and its parent hierarchy, we have: (1) “a oS 


SH SH H »SH 
i Pi pH P> ig »P> € Param! A pelt = iH PS” ; 
(2) V 


Pre Poe Py € Param", p? <p pl! a pP <p p? np pl 


SH SH 


Param = pil € Param 


A fact reflects information that has to be analysed according 
to dimensions and is modelled through one or several indicators 
called measures. 


Definition 3.5. A fact, noted F € FC is defined as (NF, MF TF 
1Star®) where N* is a fact name, M’ = {mk, Suis m*} is a set of 
a Orr ie } is a set of fact instances. The value of a 

measure mf, of the instance a is denoted as lth IStar® : IF > 
DF is a function where D* is the cartesian product over sets of 
dimension instances, which is defined as DF = [] D,€Star€ (F) pPr, 


measures. Jf = 


TStar’ associates fact instances to their linked dimension instances. 


We complete these definitions by a function extend(H}, H2) al- 
lowing to extend the parameters of the first (sub)hierarchy H, by 
the other one (H2). 


4 AN AUTOMATIC APPROACH FOR DW 
MERGING 


Like illustrated in Figure 1, merging two DWs implies matching 
steps and steps dedicated to the merging of dimensions and facts. 
The matching of parameters and measures are based on syntac- 
tic and semantic similarities [10][6] for the attribute or measure 
names. Since the matching is intensively studied in the literature, 
we focus in this paper only on the merging steps of our process 
(green rectangle in Figure 1). In regard to the merging, we firstly 
define an algorithm for the merging of hierarchies by decomposing 
two hierarchies into sub-hierarchy pairs and merging them to get 
the final hierarchy set. Then, we define an algorithm of dimension 
merging concerning both instance and schema levels and which 
completes some empty values. Finally, we define an algorithm of 
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the star merging based on the dimension merging algorithm which 
merges the dimensions and the facts at the schema and instance 
levels and corrects the hierarchies after the merging. 


( (Parameter matching _} Record of the matched parameters } 
Generation of the sub-hierarchy pairs 
Biaeellyy asia: Merging of the sub-hierarchy pairs 
+ 


Generation of the final hierarchy set 


Schema merging ) 


Dimension merging 
(instance merging and complement_]} 


Star schema merging 


Fact merging ( 


Instance merging J 


{ Correction of the hierarchies } 


Figure 1: Overview of the merging process 


4.1 Hierarchy merging 


In this section, we define the schema merging process of two hier- 

archies coming from two different dimensions. The first challenge 

is that we should preserve the partial orders of the parameters. The 

second one is how to decide the partial orders of the parameters 

coming from different original hierarchies. These challenges are 

solved in the algorithm proposed below which is achieved by 4 

steps: record of the matched parameters, generation of the sub- 

hierarchy pairs, merging of the sub-hierarchy pairs and generation 

of the final hierarchy set. 

Algorithm 1 MergeHierarchies (Hj, H2) 

Output: A set of merged hierarchies H’ or two sets of merged hierarchies 

H™ and H? 

1: M, SH’, H’ — Q;//M is an ordered set of the couples of matched 

parameters with possibly the couple of the last parameters, for the nth 


parameter couple M[n — 1], M[n — 1][0] represents the parameter of 
H, in M[n - 1], while M[n — 1][1] represents the one of H2. 


2: Param®, Param? , Param>”2, Param> 2’ — 0; 
3: for each pr," € Param" do 

4: for each P;” € Param": do 

5 if o; oe P;” then 

6 M+<M+< pr : oe >} 

7 end if 

8 end for 

9: end for 


10: if M = 0 then 

un: AY | {H, }; H? <— {Hp}; 

12: return HH” 

13: else 

14. mp —< Param [|Param™ | -—1], Param" [|Param' |-1] > 
//pair of the last parameters 

15: if m; ¢ M then 


16: M<M4m,; 

17: endif 

18: fori=0to|M|-—2do 

19: po = M{i] [0]; //first parameter of SH, 

20: of < M[i+1][0]; //last parameter of SH; 
21: oo <— M[i][1]; //first parameter of SH 
22: Pe? <— M[i+1][1]; /Mast parameter of SH2 
23: if Param®™ Cc Param>" then 

24: SH’ — {SH}; 

25: else if Param>”2 C Param>™ then 
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26: SH’ — {SH}; 

27: else if FDsH, sH, ~ 9 then 

28: for each Param’ € MergeParameters(FDsy, sH,) do 
29: Param>#4 — Param’; SH’ — SH’ + SHg; 

30: end for 

31: else 

32: SH’ — {SHy, SH» 2 

33: end if 

34: H’ — {Hq.extend(SH,)|(Hq € H’) A (SH; € SH’)}; 
35: end for 

36: end if 


37: if id?1 ~ idP?? then 
38: H’ —H’vu {H,, H>}; 
39: return H’ 


40 else . 

41: 1 — pe; //first parameter of SH, 

42: y Vv & M[0][0]; //ast parameter of SHy 
1 

43: te — oo //first parameter of SH 

44: o 2” — M[0][1]; //last parameter of SH» 


45: for each Hi, € H’ do 

46: He SHy .extend(H_?.); H? — SH» .extend(H_?.); 
47: end for 

43: H" —H" U {Hy}; H" — H* U {Hp}; 

49: return HH? 

50: end if 


4.1.1 Record of the matched parameters. The first step of the al- 
gorithm consists in matching the parameters of the two hierar- 
chies and record the matched parameter pairs(L1-Lo). If there is 
no matched parameter between the two hierarchies, the merging 
process stops (Lj1-L12). 


4.1.2 Generation of the sub-hierarchy pairs. Then the algorithm 
generates pairs containing 2 sub-hierarchies (SH; and SH) of the 
original hierarchies whose lowest and highest level parameters are 
adjacent in the list of matched parameter pairs that we created in 
the previous step (Lig-L22). To make sure that the last parameters 
of the two hierarchies are included in the sub-hierarchies, we also 
add the pair of the last parameters into the matched parameter pair 


(L14-L17). 


Example 4.1. In Figure 2, for (a), we have H1.Code ~ H2.Code, 
H1.Department ~ H2.Department, H1.Continent ~ H2.Conti- 
nent. So for the first sub-hierarchy pair, the first parameter of SH 
and SH» is Code and their last parameter is Department, so we have: 
Param>™ =< Code, Department >, Param>#2 =< Code, City, 
Department >. In the second sub-hierarchy pair, we get the sub- 
hierarchy of H, from Department to Continent : Param! =< 
Department, Region, Continent >, and the sub-hierarchy of H2 
from Department to Continent : Param®#2 =< Department, Coun- 
try, Continent >. If the last parameters of the two original hier- 
archies do not match, like Continent of H; and Country of H3 in 
(b), < Continent,Country > is added into the matched parame- 
ter pair M of the algorithm so that the last sub-hierarchies of H; 
and H3 are Param® =< Department, Region, Continent > and 
Param*!3 =< Department, Country >. 


4.1.3. Merging of the sub-hierarchies. We then merge each sub- 
hierarchy pair to get a set of merged sub-hierarchies (SH’) and 
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Figure 2: Example of generation of the sub-hierarchy pairs 


combine each of these sub-hierarchy sets to get a set of merged 
hierarchies (H’) (L23-L35). 

The matched parameters will be merged into one parameter, so 
it’s the unmatched parameters that we should deal with. We have 
2 cases in terms of the unmatched parameters. 

If one of the sub-hierarchies has no unmatched parameter, we 
obtain a sub-hierarchy set containing one sub-hierarchy whose 
parameter set is the same as the other sub-hierarchy (L23-L26). 


Example 4.2. For the first parameter pair SH, =< Code, Depart- 
ment > and SH2 =< Code, City, Department > of H, and H2 in 
Figure 4. We see that SH; does not have any unmatched parame- 
ter, so the obtained sub-hierarchy set contains one sub-hierarchy 
whose parameter set is the same as SH2 which is ParamS#” =<< 
Code, City, Department >>. 


The second case is that both two sub-hierarchies have unmatched 
parameters (L27-L39). We then see if these unmatched parameters 
can be merged into one or several hierarchies and discover their 
partial orders. Our solution is based on the functional dependencies 
(FDs) of these parameters. To be able to detect the FDs of the param- 
eters of the two sub-hierarchies, we should make sure that there are 
intersections between the instances of these two sub-hierarchies 
which means that they should have same values on the root pa- 
rameter of the sub-hierarchies. We keep only the FDs which have 
a single parameter in both hands and which can not be inferred 
by transitivity. These FDs are represented in the form of ordered 
set (FDsH, sH,) are then treated by algorithm 2 MergeParameters 
to get the parameter sets of the merged sub-hierarchies. If it’s not 
possible to discover the FDs, the two sub-hierarchies are impossible 
to be merged (L31-L32). 

Algorithm 2 MergeParameters constructs recursively the param- 
eter sets from the FDs in the form of ordered sets. In each recursion 
loop, for each one of these sets, we search for the other ones whose 
non-last (or non-first) elements have the same values and order as 
its non-first (or non-last) elements and then merge them (L¢-L21). 
The recursion is finished until there are no more two sets being 
able to be merged (L22-L3}). 


Example 4.3. If we have FD =<< A,B >,< B,C >,< B,F > 
,< C,E >,< D,B >>. Like illustrated in Figure 3, in the first 
recursion, by merging the ordered set, we get Param =<< A, B,C > 
,< A,B, F >,< B,C,E >,< D,B,C >,< D,B,F >>, all the ordered 
sets in FD are merged, so there are only merged ordered set in 
Param. Param is then inputted to the second recursion, we then 
get the next Param =<< A,B,C,E >,< D,B,C,E >> after the 
merging of the ordered sets, since < A, B, F > and < D,B, F > are 
not merged, they are also added into Param, and we get Param =<< 
A,B,C,E >,< D,B,C,E >,< A,B, F >,< D,B,F >>. In the final 
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recursion, it’s no more possible to merge any two ordered sets, 
so the parameter set of the final result of the hierarchy set is << 
A, B,C, E >, < D,B,C,E >,< A,B, F >,< D,B,F >>. 


Input FD: AB) (BLXC! (Bie F  (C++E DB} 
CS ek oe waw MO ei ee PS eicisiee Sees 
First recursion: A —~B = cD A—+B—F B > Oe E D Bo co 


Second recursion: A>B—oCOE D—B— CE 
‘ 

Output a 

pO™ F ) 


Third recursion: 


Figure 3: Example of parameter merging based on FDs 


Algorithm 2 MergeParameters(FD) 
Output: A set of parameter sets Param 
1: Le |FD |; 
2: forn<Otol-1 do 
3:  fdmerged[n| < False; //Boolean indicating whether an element 
in FD is mergerd 
4: end for 
5: existmerged < False; //Boolean indicating whether there are 
elements that are merged in a recursion loop 
6: fori — 0tol—1do 
7. for j —i+1toldo 
8: if FD[i][1:1-1] = FD[j][0: 1-2] //FD[a][b: c] 
represents the ordered set having the values and order from the 
bth element to the cth element of FD[a] then 


9: Param’ — FD[{i][1:1-1]; 
10: Param’ — Param’ + FD{[j|[l-1]; 
11: Param — Param + Param’; 
12: fdmerged|i|, fdmerged|j], existmerged — True; 
13: end if 
14: if FD[i|[0: 1-2] =FD[j][1:1-1] then 
15: Param’ — FD[j][1:1-1]; 
16: Param’ — Param’ + FD{i|[l—-1]; 
17: Param — Param + Param’; 
18: fdmerged|i], fdmerged|j], existmerged — True; 
19: end if 
20: end for 
21: end for 


22: if existmerged = True then 
23: form+<0tol—1do 


24: if fdmerged|m]| = False then 
25: Param — Param + FD[m]; 
26: end if 


27: end for 

28: | Param — MergeParameters(Param); 
29: else 

30: Param <— FD; 

31: end if 

32: return Param 


After the merging of each sub-hierarchy pair, we extend the final 
merged hierarchy set by the new merging result (L34). 


4.1.4 Generation of the final hierarchy set. L37-L49 concerns the 
generation of the final hierarchy set. The two original hierarchies 
may have different instances, so there may be empty values in the 
instances of the merged hierarchies. Some empty values can be 
completed, which is introduced in the next section of dimension 
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merging. But not all empty values can be completed. The empty 
values generate the incomplete hierarchies and make the analysis 
difficult. Inspired by the concept of the structural repair[1], we 
also add the two original hierarchies into the final hierarchy set. 
Then for a parameter which appears in different hierarchies, it 
can be divided into different parameters in different hierarchies of 
the hierarchy set so that each hierarchy is complete. Thus, for the 
multidimensional schema that we get, we provide an analysis form 
like shown in Figure 4. In the analysis form, one parameter can be 
marked with different numbers if it is in different hierarchies. 

For the generation of the final hierarchy set, we discuss 2 cases 
where the 2 hierarchies have the matched root parameters which 
means their dimensions are the same analysis axis and the opposite 
case which will lead to 2 kinds of output results (one or two sets of 
merged hierarchies). 

If the root parameters of the two original hierarchies match, we 
simply add the two original hierarchies into the merged hierarchy 
set obtained in the previous step to get one final merged hierarchy 
set. (L37-L39). 


Example 4.4. For the hierarchies H; and H2 in Figure 4, we 
combine the merged hierarchy obtained in Example 4.4 with the 
result gained in Example 4.2 to get the merged hierarchy Hy : 
< Code, City, Department, Region, Country, Continent >. We add 
Hy into the hierarchy set H’ and then also add the original hierar- 
chies H; and H2. Thus H’ is the final merged hierarchy set. 


m 
CONTINENT2 


2 


joy CONTINENT COUNTRY2 CONTINENT3 


ae a 1 
CONTINENT | @ COUNTRY CONTINENT1 @ REGION2 PAYS3 


r 
Z DEPARTMENT DEPARTMENT2, DEPARTMENT3 


REGION1 


CITY3 


DEPARTMENT] ¢y CITY2 


H’(Analysis form) 


Figure 4: Hierarchy merging example 


If the root parameters of the two original hierarchies do not 
match, we will get two merged hierarchy sets instead of one. For 
each original hierarchy, the final merged hierarchy set will be the 
extension of the sub-hierarchy containing all the parameters which 
are not included in any one of the sub-hierarchies created before 
(SHy and SHz) with the merged hierarchy set that we get plus this 
original hierarchy itself (L4,-L49). 


Example 4.5. In Figure 5, between H, and H3, we have H;.Depart- 
ment ~ H3.Department and H,.Continent ~ H3.Continent. We 
can then get one sub-hierarchy pair in which there are 2 sub- 
hierarchies containing parameter sets < Department, Region, Con- 
tinent > and < Department, Country, Continent >. By merging 
the sub-hierarchy pairs, we get the merged hierarchy whose param- 
eter set is < Department, Region, Country, Continent >. For Hy, 
the remaining part < Code > is associated to it to get the merged 
hierarchy His We then get the merged hierarchy set of H; contain- 
ing H; and ins We do the same thing for H3 and get the merged 
hierarchy set containing H3 and He 
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4.2 Dimension merging 


This section concerns the merging of two dimensions having matched 
attributes which is realized by algorithm 3 MergeDimensions. We 
consider both the schema and instance levels for the merging of 
dimensions. The schema merging is based on the merging of hier- 
archies. Concerning the instances, we have 2 tasks: merging the 
instances and completing the empty values. 


Algorithm 3 MergeDimensions (Dj, D2) 


Output: One merged dimension D’ or two merged dimensions D™ and 
D* 
1: if id?! ~ id?2 then 
2 HP eg 0; 
3: foreach H?} ¢ H”?! do 
4 for each i: ¢ H?2 do 
5 HH” —H™u MergeHierarchies(H, ', H;”); 
6 end for 
7: end for 
8 AP’ — AP VAP? ™ — HP’ \ (HP UH); 
9: CompleteEmpty(D’, D’,H™); 
10: return D’ 
11: else ; , : 
12: HD HD’ AD’ aD’ ~— 9, 
13: foreach Hy} €« H”! do 


14: for each Hy? e H?2 do 

15: H" Ho? & MergeHierarchies(H;”', H;” ; 
1’ 1’ , or vig , 

16: HP —HP uH!';H? —H” UB’; 

17: end for 


18: end for 


19: foreach HD" € HD" do 
20: AD" — AD" UParanee : 
21: endfor ; ; 
22: foreach HD « HP’ do 
D2 Dp?’ yD” 
23: A cA UParam™?  ; 
24: end for 
2; HM — HP’ \ WP1.H™ — HP’ \ HP»; 
26: CompleteEmpty(D' , D*,H™); 
27: CompleteEmpty(D* , D" ,H™?); 
28: return D" Dp? 
29: end if 


4.2.1 Schema merging. If the root parameters of the two dimen- 
sions match, the algorithm generates a merged dimension (L1-Lg). 
The hierarchy set of the merged dimension is the union of the hier- 
archy sets generated by merging every 2 hierarchies of the original 
dimensions (L3-L7). We also get a hierarchy set containing only the 
merged hierarchies but no original hierarchies (H™) which is to be 
used for the complement of the empty values (Lg). The attribute set 
of the merged dimension is the union of the attribute sets of the 
original dimensions (Ls). 


Example 4.6. Given 2 original dimensions D; and D2 in Figure 8 
and their instances in Figure 6, we can get the merged dimension 
schema D’ in Figure 8. In D’, Hj and Hp are the original hierarchies 
of D;, H3 and Hyg are those of D2, Hj3 is a merged hierarchy of 
Hy, and H3, and H24 is a merged hierarchy of Hz and H4. We can 
thus get HD’ = {Hy, Ho, H3, Ha, Hy3, Hoa}, Hm = {H13, Ho4}, AP = 
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{Code, City, Department, Region, Country, Continent, Profession, 
Subcategory, Category} 


H,,! 13 
CONTINENT2 CONTINENT2 


H. H, H 


1 st 
CONTINENT CONTINENT CONTINENT1 


Figure 5: Dimension merging example (schema) 


When the root parameters of the two dimensions don’t match, 
we will get a merged dimension for each original dimension, which 
is realized by L13-L25. For each original dimension, the hierarchy 
set of its corresponding merged dimension is the union of all hier- 
archy sets generated by merging every 2 hierarchies of the original 
dimensions (L13-Lj), the attribute set is the union of the attributes 
of each hierarchy in the merged dimension (Lj9-L24). Similar to 
the first case, we get a hierarchy set containing only the merged 
hierarchies for each original dimension (H™ and H™2) (L26-L27). 


Example 4.7. Given 2 original dimensions D; and D2 in Figure 5 
and their instances in Figure 7, after the execution of algorithm 3 
MergeDimensions, we can get the merged dimension schema Dp! 
and D® in Figure 5. In D’, Hy and Hz are the original hierarchies 
of Dj, ibe is the merged hierarchy of H; and H3. In D2, Hz is the 
original hierarchy of Do, she is the merged hierarchy of H; and 
Hg. So for Dy, we have HP” = {Hj, Hp, H},}, Hm = {H1,}, AP = 
{Code, Department, Region, Country, Continent, Profession, Cate- 


gory}, while for D2, we get HD - {H3, H?,}, Hm = {H?,}, 
AD” = {City, Department, Region, Country, Continent} 


4.2.2 Instance merging and complement. When the root parameters 
of the two dimensions match, the instance of the merged dimension 
is obtained by the union of the two original dimension instances 
which means that we insert the data of the two original dimension 
tables into the merged dimension table and merge the lines which 
have the same root parameter instance (Lo). 


Example 4.8. The instance merging result of Example 4.2.1 is 
presented in Figure 6. All the data in the original dimension tables 
D,, D2 are integrated into the merged dimension table D’. The 
original tables of the instances are marked on the left of the merged 
table D’ with different colors. There are instances coming from both 
Dj, and D2, which means that they have the same root parameter 
in D; and Dg, and are therefore merged together. 


The attribute set of the merged dimension contains all the at- 
tributes of two original dimensions, while the original dimensions 
may contain their unique attributes. So there may be empty values 
in the merged dimension table on the instances coming from only 
one of the original dimension tables and we should complete the 
empty values on the basis of the existing data (Lo). 

The complement of the empty values is realized by Algorithm 
CompleteEmpty where the input D" isthe merged dimension table 
having empty values to be completed, D* is the merged dimension 
table which provides the completed values and Hy is the hierarchy 
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CODE DEPARTMENT REGION CONTINENT PROFESSION CATEGORY CODE CITY DEPARTMENT COUNTRY CONTINENT PROFESSION SUBCATEGORY 


(ci D1 R1 CTN1 P1 cG1 (ca ct1 | D1 CTR1 cTN1 P1 sci 
ic2} D2 R2 CTN1 P1 G1 C2 | ct2 | p2 cTR1 cTN1 P1 sci 
c3 D1 R1 CTN1 P2 G1 ica cT4 | D3 CTR2 cTN1 P3 Sc2 
ica} D3 R3 CTN1 P3 cG2 1C6 + cté | D4 cTR2 cTN1 P5 sca 
cs D6 R5 CTN2 Pa cG2 1C71 cTé | D4 CTR2 cTN1 P6 sca 
(ce: D4 R3 CTN1 PS cG1 ca cT3_ | D7 CTR3 CTN2 Ps sca 
hc7 D4 R3 CTNA P6 cG1 co cts | D4 cTR2 cTN41 P7 sc2 
D, a | D, 
CODE CITY DEPARTMENT REGION COUNTRY CONTINENT PROFESSION SUBCATEGORY CATEGORY 
ci {cma ./p1 R1 CTR1 CTN1 P1 sci cG1 
Pee Tie ct2 | D2 R2 cTR1 cTN1 P1 sci cG1 
D, +4¢c3~ | NULL!|D1 R1 NULUSETRI || CTN1 P2 NULL cG1 
DD, +j}c4 | cT4 | D3 R3 CTR2 CTN1 P3 cp se2 ce2 
D, +45 | NULL | D6 R5 NULL CTN2 P4 ) | NULL ce2 
on {| cé |cté |p4 R3 CTR2 CTN1 P5 sca cea 
c7|cTé ,|/p4 R3 CTR2 cTN1 P6 | yy sca cG1 
& {5 cT3 ; D7 NULL CTR3 cTN2 Ps i sca NULL>CG1 
co {cts ‘{p4 NULL>R3 | CTR2 cTN1 P7 “-Fsc2 NULL->CG2 


D’ 


Figure 6: Dimension merging example (instance) 


set of D'’ containing only merged hierarchies but no original hier- 
archies. In this discussed case, D’ is inputted as both D” and D? 
in CompleteEmpty since we get one merged dimension including 
all data of two original dimensions (L;1). 


Algorithm 4 CompleteEmpt y(D" , D*,H™) 
1: for each H?” ¢ H™ do 
2 I" —9O; 
S PP ea]? U {i2" € pr iP" pie is not null) A(apee € 
Param’ , jb" pie is null) }; 
4: foreach i? <1” do 


5: Pr e— {pila € ParamMa’ |i? py? is null}; 
6: Pr {pia € Param™a’ | (i? py 4 is not null) A(Vp? ¢« 
Aq 

P", Po“ <H PS)}; 

7: if 3D” € 1D” Apt, e P’,(iD’ pt, = it.pt,) A (vpn € 
p”,iP” p? is not null) then 

8: for each p” € P” do 

9: ip? — iD” pr: 

10: end for 

ve end if 

12: end for 

13: end for 


For an empty value, we search for an instance which has the 
same value as the instance of this empty value on one of the pa- 
rameters rolling up to the parameter of the empty value and whose 
value of the parameter of the empty value is not empty, we can 
then fill the empty by this non-empty value. The complement of 
the empty values is also possibly a change of hierarchies. Never- 
theless, after completing the empty values of an instance, there 
may be some completed parameters which are not included in the 
hierarchies of the instance, so the complement of such values does 
not make sense in this case. The possible change of the hierarchy 
is from the hierarchies containing less parameters to those con- 
taining more parameters. We know that the merged hierarchies 
contain more parameters than their corresponding original hierar- 
chies. Hence, before the complement of an instance, we will first 
look at the merged hierarchies to decide which parameter values 
can be completed. 

In algorithm 4 CompleteEmpty which aims to complete the 
empty values, for each hierarchy in the merged hierarchy set we 
see, if (a) there exists instances in the merged dimension table 
which contains empty values on the parameters of this hierarchy 
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(L3) and (b) where the value of the second lowest parameter is not 
empty (L3). The condition a is basic because we need empty values 
to be completed. Since we will complete the empty values by the 
other lines of the merged dimension table, we can only complete 
the empty values based on the non-id parameters since the id is 
unique, so if the second lowest parameter is empty, it can never 
be completed so that the hierarchy can never be completed. That’s 
why we have the condition b. For each one of the instances satisfy- 
ing these conditions (I”), we search for the parameters (P”) having 
empty values (L5) and to make sure that each one of them can be 
completed, we search also for the parameters (P”) which roll up to 
the lowest of them and to which we refer to complete the empty 
values (L¢). We can then complete the empty values like discussed 
in the previous paragraph (L7-Lj1). 


Example 4.9. After the merging in Example 4.9, we get the empty 
values of D’ which are in red in Figure 6. The merged hierarchies 
are H;3 and Ho, as illustrated in Figure 5. For H13, the instances of 
code C3 and C5 have empty values on the second root parameter 
City, which do not satisfy the condition b. As we can see, for the 
instance of C3, although the value of Country can be retrieved 
through the value of Department which is the same as the instance 
of C1, the value of City can not be completed and thus we should 
give up this complement. For the instance of C9, the value of Region 
is completed by C7 which has the same value of Department and 
whose value of Region is not empty. When it’s the turn of H24, 
values of Category of C8 and C9 are completed in the same way. 


When the root parameters of the two dimensions don’t match, 
the instance merging and complement are done by L26-L27. The 
values of the attributes of one of the dimension tables coming from 
the other dimension table are empty, so there is only instance com- 
plement but no merging. We also call algorithm 4 CompleteEmpty 
to complete the instances for each one of the merged dimension 
tables. 


Example 4.10. The instance merging and complement of the ex- 
ample for Example 4.8 is demonstrated in Figure 7. For D"’ , Country 
comes from the dimension table D2, so the values of Country are 
completed by the values in D* . The same operation is also done 
for Region of D*. 


CODE DEPARTMENT REGION CONTINENT PROFESSION CATEGORY CITY DEPARTMENT COUNTRY CONTINENT 


ci D1 R1 cTN1 P1 cG1 cTi |D1 CcTR1 CcTN1 
c2 D2 R2 CTN1 P1 cG1 cT2 | D2 CTR1 CTN1 
c3 D1 R1 cTN1 P2 cG1 cT3 | D7 CTR3 CTN2 
c4 D3 R3 CTN1 P3 cG2 cT4 | D3 CTR2 CcTN1 
cs D6 RS CTN2 P4 cG2 CTS | D4 CTR2 CcTN1 
cé D4 R3 CcTN1 PS cG1 cT6 | D4 CTR2 CcTN1 
c7 D4 R3 CTN1 P6 cG2 


: i 


CODE DEPARTMENT REGION COUNTRY CONTINENT PROFESSION CATEGORY 
c1 Di R1 NULL-DCTR1 | CTN1 P1 cG1 cTi |D1 NULL->R1 | CTR1 CTN1 
c2 D2 R2 NULL>CTR1 | CTN1 P1 cG1 cT2 | D2 NULL>R2 | CTR1 cTN1 
c3 Di R1 NULL>CTR1 | CTN1 P2 cG1 cT3 =| D7 NULL CTR3 CTN2 
c4 D3 R3 NULL>CTR2 | CTN1 P3 cG2 cT4 | D3 NULL->R3 | CTR2 CTN1 
cs D6 R5 NULL CTN2 P4 cG2 CT5 | D4 NULL~>R3 | CTR2 CTN1 
cé D4 R3 NULL>CTR2 | CTN1 PS cG1 cTé =| D4 NULL>R3 | CTR2 CTN1 
c7 D4 R3 NULL CTR2 | CTN1 P6 cG2 


D, 


CITY DEPARTMENT REGION COUNTRY CONTINENT 


bY D?’ 


Figure 7: Dimension merging example (instance) 
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4.3 Star merging 


In this section, we discuss the merging of two stars. Having two 
stars, we can get a star schema or a constellation schema because 
the fact table of each schema may be merged into one schema or 
not. The star merging is related to the dimension merging and 
fact merging. Two stars are possible to be merged only if there are 
dimensions having matched root parameters between them. 


Algorithm 5 MergeAllDimensions (Sj, Sz) 


Output:A set of merged dimensions D* 
1: for each D>! € D®! do 


2: foreach D?? e D* do 
Sy S2 
if id?i x id?i then 


3 

4 DD? ee MergeDimensions(D?',D7”); 
5 end if 

6: end for 

7: end for 

8: D> <— 0; 

9: for each D>! € DS! do 

10: foreach D2? € D®? do 

11: if idPu a ide’ then 

12: D* —D*uU MergeDimensions(D>', D°): 
13: end if 

14: end for 

15: end for 


16: for each D> —« D*’ do 


DS’ s! 
k D 
17: foreachH,,~ ¢€ H~k do 
/ 


ps ps’ pS’ s! s! 
18: if fi,* eI-k ,(i,* isonH,,* ) v (i;* is only on 

Ss! Ss! S 

D S D S 
Hk A(H,* €¢ HP" vH,,* ¢ H?)) then 

Dp” D / Dp’ 
19 H’k —H°k -H,,* ; 
20 end if 
21: end for 
22: end for 


/ 
23: return D° 


For the dimensions of the two stars, we have two cases: 1. The two 
stars have the same number of dimensions and for each dimension 
of one schema, there is a dimension having matched root parameters 
in the other schema. 2. There exists at least one dimension between 
the two stars which does not have a dimension having a matched 
root parameter in the other. 

The dimension merging of two stars is common for the two cases 
which is done by algorithm 5 MergeAllDimension. We first merge 
every two dimensions of the two stars which have unmatched 
root parameters because the merging of such dimensions is able to 
complete the original dimensions with complementary attributes 
(L,-L7). Then the dimensions having matched root parameters are 
merged to generate the merged dimensions of the merged multidi- 
mensional schema (Lg-Lj5). After the merging and complement of 
the instances of the dimension tables, there may be some merged 
hierarchies to which none of the instances belong. In this case, if 
there will be no more update of the data, such hierarchies should 
be deleted. There may also be original hierarchies in the merged 
dimensions such that there is no instance which belongs to them 
but does not belong to any merged hierarchy containing all the pa- 
rameters of this original hierarchy. The instances belonging to this 
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kind of hierarchies belong also to other hierarchies which contains 
more parameters,so they become useless and should also be deleted 


(Lig-L19). 


Example 4.11. For the merging of the dimensions of two stars 
S; and Sz in Figure 8. The dimension Product of S; and the dimen- 
sion Customer of S2 are firstly merged since their root parameters 
don’t match but they have other matched parameters. There are 
then attributes of dimension Customer of Sz added into dimension 
Product of S;. The two dimensions Customer and the two dimen- 
sions product have matched root parameters, so they are merged 
into the final star schema. After the merging and complement of 
the instance, we verify each hierarchy in the merged dimension ta- 
bles. If the merging of S;.Customer and S2.Customer is as shown in 
Figure 5 at the schema level and in Figure 6 at the instance level. In 
their merged dimension table D’. We can find that all the instances 
belonging to H4 also belong to H24 which is a merged hierarchy 
containing all the parameters of H4, so H4 should be deleted. 


We then discuss the merging of the other elements in the two 
cases which is processed by algorithm 6 MergeStar: 


Algorithm 6 MergeStar (Sj, S2) 


Output:A merged multidimensional schema which may be a star schema 
S’ or a merged constellation schema C’ 


1: if ([D°!| = |D51|) A (VD?! € DS aD? € DS, idPi! = id?’ ) 

then 
? DS & MergeAllDimensions(S\, S2); 
3 MPO MP MPP PS TP UP, 
4 Siar” e 1StarF™! UTStar?” 
5 return S’ 
6: else 
7 DX eae MergeAllDimensions(S\, S2); io {FS1, FS: ie 
8: return C’ 

9: end if 

For the first case, we merge the two fact tables into one fact table 
and get a star schema. The measure set of the merged star schema is 
the union of the 2 original measures (L3). The fact instances are the 
union of the measure instances of the two input star schemata (L,). 
The function associating fact instances to their linked dimension 
instances of the merged schema is also the union of the functions 
of the original schemata (L4). 


H; 
CONTINENT 


H, 

CONTINENT CONTINENT COUNTRY 
H, 

REGION CATEGORY 


H, 
DEPARTMENT SUBCATEGORY 


CONTINENT1 | REGION2 


DEPARTMENT2 


IDEPARTMENT1|CITY BRAND 


CUSTOMER PRODUCT 


Figure 8: Star merging example (schema) 
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CODE NUMBER PRICE QUANTITY TAXRATE 
c1 P1 2 15 0.1 
CODE NUMBER PRICE QUANTITY HMMM CODE NUMBER PRICE TAXRATE c2 P2 4 |B 0.07 


Fa Pi la 15 (aaah aia . c2 P6 NULL 0.06 
H 


}2 0.1 3 

2. i! p2_ {4 13 (2. pn 1 0.07 = 3 P5 5 20 NULL 
3 PS 5 20 0.06 ca Pa ao 45 NULL 
ca P4 10 | 45 (cciacal rr aes 7 4g cs Pa 8 8 0.13 
9 
5 


I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
oO 
wu 
vu 
5 
uw} oo} w |e 


cs pa {8 8 | |e lp ~ 02 fod P6 6 NULL 
cs P6 9 6 C6 P5 NULL 0.2 


FS 


Figure 9: Star merging example (instance) 


Example 4.12. For the two original star schemata in Figure 8, 
the dimension merging is discussed above so we mainly focus on 
the merging of fact table instances here. The dimensions Customer, 
Product of S; have respectively matched root parameters in the 
dimensions Customer, Product of S2. They also have the same 
number of dimensions. Therefore we get a merged star schema 
S’, the original fact tables are merged by merging the measures of 
S; and Sz to get the fact table of S’. At the instance level, in Figure 9, 
we have the instances of the fact tables, for the instances of F>! and 
F°2, the framed parts are the instances having the common linked 
dimension instances, so they are merged into the merged fact table 
F*’, the other instances are also integrated in F 5" but with empty 
values in the merged instances, but they will not have big impacts 
on the analysis, so they will not be treated particularly. 


For the second case, since there are unmatched dimensions, the 
merged schema should be a constellation schema. The facts of the 
original schemata have no change at both the schema and instance 
levels and compose the final constellation. (Lg) 


Example 4.13. This example is simplified in Figure 10 due to the 
space limit. For the original star schemata S; and S2, they have 
dimensions Customer which have the matched root parameters. 
They also have their unique dimensions: Time of S; and Product 
of Sx. So the merged schema is a constellation schema generated 
by merging the dimensions Customer and by keeping the other 
dimensions and fact tables. At the instance level, we just have a 
new merged dimension table of Customer, the other dimension and 
fact tables remain unchanged. 


CODE SALES Der SALES 


CUSTOMER PRICE TIME PRICE 
QUANTITY QUANTITY 


Sy Ta CUSTOMER 


CODE SALES NUMBER NUMBER 
CUSTOMER PRICE PRODUCT PRODUCT 
TAXRATE 


Ss, ¢ 


Figure 10: Star merging example (schema) 


5 EXPERIMENTAL ASSESSMENTS 


To validate the effectiveness of our approach, we applied our al- 
gorithms on benchmark data. Unfortunately, we did not find a 
suitable benchmark for our problem. So, we adapted the datasets 
of the TPC-H benchmark to generate different DWs. Originally, 
the TPC-H benchmark serves for benchmarking decision support 
systems by examining the execution of queries on large volumes of 
data. Because of space limit, we put the test results in github’. 


"https://github.com/Implementation111/Multidimensional-D W-merging 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


5.1 Technical environment and Datasets 


The algorithms were implemented by Python 3.7 and were executed 
on a processor of Intel(R) Core(TM) i5-8265U CPU@ 1.60GHz witha 
16G RAM. The data are implemented in R-OLAP format through the 
Oracle 11g DBMS. The TPC-H benchmark provides a pre-defined 
relational schema’ with 8 tables and a generator of massive data. 

First, we generated 100M of data files, there are respectively 
600572, 15000, 25, 150000, 20000, 80000, 5, 1000 tuples in the table of 
Lineitem, Customer, Nation, Orders, Part, Partsupp, Region and 
Supplier. Second, to have more deeper hierarchies, we included the 
data of Nation and Region into Customer and Supplier, and those 
of Partsupp into Part. Third, we transformed these files to generate 
two use cases by creating 2 DWs for each case. To make sure that 
there are both common and different instances in different DWs, 
for each dimension, instead of selecting all the corresponding data, 
we selected randomly 3/4 of them. For the fact table, we selected 
the measures related to these dimension data. Since the methods in 
the related work do not have exactly the same treated components 
or objective as the ours, we do not have comparable baseline in our 
experiments. 


5.2 Star schema generation 


SEGMENT CUSTKEY, 


i SUPPKEY 
CUSTOMER Lineorder SUPPLIER NATION _ REGION 


MANUFACTURE BRAND PARTKEY, QUANTITY 
P, 


ORDERDATE SEMESTER YEAR 


RT EXTENDEDPRICE DATE 
TAX 
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Sy 
NATION GUSTKE ; SUPPKEY 
CUSTOMER Mm Hneorder HT <ppiicR sai 
TYPE PARTKEY, QUANTITY ORDERDATE MONTH YEAR 
PART DISCOUNT DATE 
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NATION’ —_ SEGMENT ¥ NATION 


REGION NATION2~GUSTKEY, Lineorder 
CUSTOMER a SUPPLIER 


QUANTITY 
ORDERDATE SEMESTER1 YEAR1 
EXTENDEDPRICE 


MANUFACTURE BRAND TAX 
DISCOUNT 


SUPPKEY NATION2 — REGION 


MONTH2_ SEMESTER2 YEAR2 


Figure 11: Star schema generation 


The objective of this experiment is to merge two star schemata 
having the same 4 dimensions with the matched lowest level of 
granularity for each dimension. 

After executing our algorithms, we obtain one star schema as 
shown in Figure 11 which is consistent with the expectations. The 
parameters of the hierarchies satisfy the relationships of functional 
dependency. The run time is 30.70s. The 3 dimensions Supplier, 
Part, Date of the original DWs are merged. Between the differ- 
ent dimensions S;.Supplier and Sz.Customer, there is a matched 
attribute Nation, so they are also merged such that S;.Supplier pro- 
vides S2.Customer with the attribute Region. Then the Customer in 
the merged DW also has the attribute Region. We can also observe 
that normally, in the merged schema, there should be the original 
hierarchy Orderdate — Month — Year of S2.Date but which is 
deleted. By looking up in the table, we find that there is no tuple 
which belongs to this hierarchy but not to Orderdate — Month > 
Semester — year, that’s why it is removed. 

At the instance level, the result is shown in github. Table 1 shows 
the number of tuples of the original DWs (Nj, N2), of the merged 
DW (N’) and the number of the common tuples (Nj) (tuples having 


“hhttp://tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf 
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Customer | Supplier | Part | Orderdate | Lineorder 
1804 252689 
1804 252821 
1349 105345 
2259 400165 


Table 1: Number of tuples 


Customer.Region|Supplier.Region|Orderdate.Semester 
Ni x x 1804 
No x 750 Xx 
N’ 9713 846 2259 
Ny 9713 96 455 


Table 2: Number of attributes 


the same dimension key in the original DWs). For each dimension 
or fact table, N’ = Ni + No — Na, we can thus confirm that there 
is no addition or loss of data. For each tuple in the original tables, 
we verify that the all the values are the same with the values in 
the merged table. We also find that there are some empty values 
of the attribute Region in the dimension Customer and Supplier 
and the attribute Semester of the dimension Orderdate which are 
completed. Table 2 shows the number of these attributes in the 
original DWs (Nj, N2) and in the merged DW (N’), we can then get 
the number of the completed values N, for these attributes. They 
meet the relationship N’ = N; + No + Ny. 


5.3 Constellation schema generation 


The objective of this experiment is to merge two star schemata 
having the same 2 dimensions (Customer, Supplier) with the same 
lowest level of granularity for each dimension, as well as 2 different 
dimensions (S;.Part and S2.Date). 


CUSTOMER Lineorder 
QUANTITY SUPPLIER 


EXTENDEDPRICE 
TAX 


SUPPKEY NATION _ REGION 
MANUFACTURE BRAND PARTKEY, 
PART 


Sy 


i SUPPKEY 
NATION SGUSTKE Lineorder SUPPLIER mann 
CUSTOMER QUANTITY Saieibae wanes 
DISCOUNT DATE 


SEGMENT 


SEGMENT Ss, 


¥ 
CUSTOMER ; 
arene al SUPPLIER 


REGION NATION2~6USTKEY, 


SUPPKEY NATION2 _ REGION 


Lineorder1 Lineorder2 


QUANTITY QUANTITY 
EXTENDEDPRICE DISCOUNT 


MANUFACTURE BRAND PARTKEY, ORDERDATE MONTH _ YEAR 


TAX 


Figure 12: Constellation schema generation 


At the schema level, the second test generates a constellation 
schema like shown in Figure 12. The run time is 32.13s. As ex- 
pected, the 2 dimensions Customer, Supplier of the original DWs 
are merged, the other dimension and fact tables are not merged. The 
dimension Customer gains a new attribute Region by the merging 
between S;.Supplier and S2.Customer. We can see that the hierar- 
chy Custkey — nation of Customer which should be in the merged 
schema is deleted because there is no tuple which belongs to this 
hierarchy but not to Custkey — nation — Region. The hierarchy 
Suppkey — nation of Supplier is removed due to the same reason. 

At the instance level, the data of experiment can be found in 
github. They also meet N’ = Nj + No — Np. There are empty values 
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of the attribute Region in the dimension Customer and Supplier 
which are completed which meet N’ = N; + No + Nx. 

We got the results conforming to our expectations in the tests, 
we can thus conclude that our algorithms work well for the different 
cases discussed at both schema and instance levels. 


6 CONCLUSION AND FUTURE WORK 


In this paper, we define an automatic approach to merge two dif- 
ferent star schema-modeled DWs, by merging multidimensional 
schema elements including hierarchies, dimensions and facts at 
the schema and instance levels. We define the corresponding algo- 
rithms, which consider different cases. Our algorithms are imple- 
mented and illustrated by various examples. 

Since we only discuss the merging of DWs modeled as star 
schemata in this paper, which is only one (albeit common) pos- 
sible DW design, we plan to extend our approach by adding the 
merging of DWs modelled as constellation schemata in the future. 
There may also be so-called weak attributes in DW components. 
Thus, we will consider them in future work. Our goal is to pro- 
vide a complete approach that is integrated in our previous work 
concerning the automatic integration of tabular data in DWs. 
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ABSTRACT 


Following the current trend, most of the well-known database sys- 
tems, being relational, NoSQL, or NewSQL, denote themselves as 
multi-model. This industry-driven approach, however, lacks plenty 
of important features of the traditional DBMSs. The primary prob- 
lem is a design of an optimal multi-model schema and its sufficiently 
general and efficient representation. In this paper, we provide an 
overview and discussion of the promising approaches that could 
potentially be capable of solving these issues, along with a summary 
of the remaining open problems. 


CCS CONCEPTS 


- Information systems — Entity relationship models; Semi- 
structured data; Data structures; Integrity checking; Query lan- 
guages for non-relational engines. 
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Multi-model data, Inter-model relationships, Conceptual modeling, 
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1 INTRODUCTION 


In recent years, the Big Data movement has broken down borders 
of many technologies and approaches that have so far been widely 
acknowledged as mature and robust. One of the most challenging 
issues is the variety of data which may be present in multiple types 
and formats (structured, semi-structured, and unstructured), and 
so conform to various models. 


EXAMPLE 1. Let us consider an example of a multi-model scenario in Fig- 
ure 1 (partially borrowed from [17]), backing a sample enterprise management 
information system where individual parts of the data are intentionally stored 
within different logical models. In particular, the social network of customers is 
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captured using a graph G (blue' ). Relational table T (violet) records additional 
information about these customers, such as their credit limit. Orders submitted 
by customers are stored as JSON documents in a collection D (green). A wide- 
column table W (red) then maintains the history of all orders, and, finally, 
a key/value mapping K (brown) stores currently existing shopping carts. A 
cross-model query might return, e.g., friends of customers who ordered any 
item with a price higher than 180. Oo 


relational table T property graph G 


address] credit 


customer 


document collection D wide-column table w 


{ order : 220, customer] orders 
paid : true, 
items : [ 1 [ 220, 230, 270, ... ] 
{ product : T1, name: toy, 
price : 200, quantity : 2 }, 2 [ 10, 217 ] 
{ product : B4, name : book, 
price : 150, quantity : 1 } ] } 3 [ 370, 214, 94, 137 ] 


key/value pairs k 


customer cart 
1 —~ |product: T1, name: toy, quantity: 2 
product: B4, name: book, quantity: 1 


2 —~ |product: G1, name: glasses, quantity: 1 
product: B2, name: book, quantity: 1 


3 —» |product: B3, name: book, quantity: 2 


Figure 1: Sample multi-model scenario 


While the problem of multi-model data management may seem 
similar to the data integration and, hence, some approaches/ideas 
can be re-used, the aim and motivation are different. In the multi- 
model world, each of the involved single-model schemas represents 
just a certain part of reality. The individual distinct models are 
chosen in order to conform to the structure of the part of the reality 
the best (and thus enable its optimal logical and physical repre- 
sentation). These parts are mutually interlinked, and together they 
form the whole reality. The multi-model schema can be viewed as a 
possible result of the data integration process, if the particular inte- 
grated sources represented distinct data models and these features 
needed to be preserved. 


Individual colors are used to highlight the different models in the rest of the text. 
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Depending on the number of (different) underlying database 
management systems (DBMSs) being involved, the so-called multi- 
model systems can be divided into single-DBMS and multi-DBMS 
(i.e., polystores). We target mainly the former group in this paper 
since such an approach is industry-driven rather than academic 
and thus more popular. Currently, there exist more than 20 repre- 
sentatives of single-DBMS multi-model databases, involving well- 
known tools from both the traditional relational and novel NoSQL 
or NewSQL systems, such as Oracle DB, IBM DB2, Cassandra, or 
MongoDB, to name just a few. 

According to a recent extensive survey [16], they have distinct 
features and can be classified according to various criteria. The 
core difference is the strategy used to extend the original model 
to other models or to combine multiple models: such new models 
can be supported via the adoption of an entirely new storage strat- 
egy, extension of the original storage strategy, creation of a new 
interface, or even no change in the original storage strategy (which 
is used for trivial cases). This bottom-up approach, driven by the 
needs of novel applications, lacks plenty of important features of 
the traditional database systems [10], though. 

The primary issue is designing an optimal multi-model schema 
and its sufficiently robust representation that can evolve efficiently 
and correctly with changes in user requirements. Although there 
exist mature and verified approaches commonly and successfully 
used for individual models, most of them cannot be easily extended 
for the multi-model data due to the contradictory features of the 
models (such as relational vs. hierarchical vs. graph data). 

In this paper, we provide an overview and discussion of existing 
approaches that are close to the world of multi-model data and 
could be exploited for this purpose. The main contributions of this 
paper are: 


e overview of the relevant related work for multi-model data 
modeling and representation (Section 2), 

e introduction to category theory and its relation to multi- 
model data management (Section 3), and 

e discussion of remaining research opportunities together with 
possible inspirational solutions (Section 4). 


Our objective is to introduce a promising research direction for the 
data management community, both for researchers and practition- 
ers. Considering the amount of available multi-model databases, the 
indicated target definitely has a high impact in both academia and 
industry, but still requires extensive research and implementation 
effort. 


2 EXISTING APPROACHES 


In multi-model scenarios, we intentionally combine several struc- 
turally different logical data models. To simplify the description 
and understanding, we may consider only the currently most com- 
monly used ones: relational, aggregate-oriented (i.e., key/value, 
wide-column, document), and graph. Despite definitions of these 
data models involve at least the required data structures, other as- 
pects can be covered as well. E.g., following Codd [7], (at least) the 
following three components can be assumed: (1) data structures, 
(2) operations on data structures, and (3) integrity constraints for 
operations and structures. 
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Pursuing the ideas of model-driven engineering [18], there are 
different layers of information abstraction. The conceptual platform- 
independent model (PIM) describing the problem domain is trans- 
lated to one or more logical platform-specific models (PSMs), such 
as relational or semi-structured (e.g., XML), represented using a 
selected schema language (e.g., SQL DDL or XML Schema). 

In the following sections, we will have a look at a wide range 
of existing approaches and demonstrate how they can be utilized 
when the processing of multi-model data is to be tackled. In particu- 
lar, we will cover five basic aspects closely related to data modeling 
and representation: data design (Section 2.1), logical data represen- 
tation (Section 2.2), integrity constraints (Section 2.3), knowledge 
reasoning (Section 2.4), and evolution management (Section 2.5). 


2.1 Conceptual Layer 


The purpose of the platform-independent layer is to allow us to 
model schemas of databases at the conceptual layer, i.e., without 
limiting ourselves to features of specific logical models. This means 
we can think of the database contents purely in terms of sets of 
entities, relationships, and their characteristics, and so particular 
data structures that will, later on, be used for the actual data repre- 
sentation and physical storage (as well as the implied consequences 
and limitations) are purposely treated as unimportant for us at this 
moment. 


Name Credit 


Quantity Price 
Address Name 


= 


0..N 
Id 
Name 
Credit 
Address 


Orders 


Figure 3: UML conceptual schema 


There are basically two widely used traditional conceptual mod- 
eling languages: ER [5] and UML [8] (namely class diagrams). The 
former one exists in several notations differing not just visually, but 
also in particular constructs and their variations. Altogether, ER is 
more expressive and allows us to better grasp the complex nature 
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and features of the real-world entities and their relationships. On 
the other hand, UML is well-standardized. Unfortunately, also only 
too data-oriented. This means that its constructs and expressive 
power are not that elaborate, and so UML schemas may conceal 
certain details that could be significant for us (e.g., structures such 
as weak entity types, structured attributes, etc.). 

Having a conceptual schema, it can then be transformed to a 
logical schema as a whole, as well as it can be transformed to 
multiple distinct logical schemas at a time, each covering the entire 
domain, or just a part of it, each mutually disjoint, connected, or 
more-or-less overlapping. This principle, and hence the entire idea 
of platform-independent modeling, apparently becomes especially 
important when working with data that really is, by nature, multi- 
model. 

Our ultimate objective is the ability to work with such multi- 
model data in a unified way that simplifies the reality, yet still 
thoroughly enough captures specifics of at least the majority of 
widely used logical models and data formats exploited in the ex- 
isting database implementations. Despite both ER and UML might 
seem well-suited for this purpose, there are various questions to be 
answered and trade-offs to be established. For example, not just the 
following aspects deserve attention: permitted ranges of relation- 
ship type cardinalities or attribute multiplicities (particular values 
or just N for the upper bounds), the semantics of these cardinalities 
in n-ary associations, non-trivial depths of structured attributes, 
the necessity of identifiers for entity types, the possibility of iden- 
tifiers even for relationship types, incorporation of just selected 
participants in weak identifiers instead of the involved relationship 
types as a whole, etc. 


EXAMPLE 2. Continuing with the sample multi-model data presented in 
Figure 1, its corresponding ER and UML schema diagrams are visualized in 
Figures 2 and 3. Oo 


2.2 Logical Layer 


The logical layer of data modeling aims to provide data structures 
in which the actual data is represented within a particular database 
system. For example, we can have key/value, document, or graph 
models. In fact, there is a wide span of such approaches, involving 
academic proposals (e.g., the X-SEM model [21] for XML data), 
official standards, as well as models provided by the existing DBMS 
representatives (e.g., the formal relational model vs. its derived 
versions). They can be abstract and unifying or not (generic trees 
vs. JSON), they can cover only one underlying model or more of 
them at a time (property graphs vs. associative arrays), they can 
consider only the structure of the data or extend it with its schema 
or integrity constraints, or they can even be represented by the PIM 
directly (such as the whiteboard-friendly [22] Neo4j graph model). 


NoSQL Abstract Model (NoAM). The first approach we start with, 
NoAM [1], specifies an intermediate, system-independent data rep- 
resentation for aggregate-oriented (i.e., key/value, wide-column, 
and document) NoSQL systems. The proposed methodology aims 
at designing a good (with regards to the performance, scalability, 
and consistency) representation of the data in NoSQL systems. 

A NoAM database is a set of collections having distinct names 
(e.g., a wide-column table or a document collection). A collection 
is a set of blocks (i.e., aggregates), each identified by a block key 
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unique within that collection (e.g., a row in a wide-column table or 
a document in a collection) and being a maximal data unit for which 
atomic, efficient, and scalable access operations are provided. A 
block is a non-empty set of entries (e.g., table columns or document 
fields). An entry is a pair (ex, ey), where ez is the entry key (unique 
within its block), and ey is (complex or scalar) entry value. 


EXAMPLE 3. Consider the document model of our sample data in Ex- 
ample 1. NoAM first defines two basic representation strategies — Entry per 
Aggregate Object (EAO) and Entry per Top-level Field (ETF). In EAO, document 
collection D would be represented as depicted in Figure 4 (a). Each block 
corresponds to one order, block key corresponds to the order identifier (220). 
There is only one entry in each block, so the entry key is empty (€), and the 
value contains the rest of the data. For ETF, the block would have two entries 
with keys corresponding to top-level fields, as shown in Figure 4 (b). Oo 


document collection D 


a) EAO 


{ paid : true, 
items :[ { product: T1,... } ] } 


al - 


items [ { product: T1,...} ] 


c) Access paths 


{ product: T1,... } 
items[1] | { product: B4,... } 


Figure 4: Strategies of NoAM 


Next, data access patterns are considered using the notion of an 
access path, i.e., a location in the structure of a complex value. If v 
is a complex value and v’ is a (possibly complex) value occurring 
in v, then the access path apy for v’ in v represents the sequence 
of steps that need to be taken to reach the component value v’ 
in v. A complex value v can then be represented using a set of 
entries, whose keys are access paths in v. Each entry is expected to 
represent a distinct portion of the complex value v, characterized 
by a location in its structure. 


EXAMPLE 4. Based on the access patterns, collection D could be parti- 
tioned, e.g., as depicted in Figure 4 (c), where access paths to particular fields 
in the array are assumed. Oo 


The proposed good choice of aggregates and their partitioning is 
driven by the data access patterns of the application operations, as 
well as by scalability and consistency needs. In addition, to support 
the strong consistency of update operations, each aggregate should 
include all the data involved by some integrity constraints. On the 
other hand, aggregates should be as small as possible. Moreover, 
a particular set of rules for the partitioning of the data model into 
aggregates that reflects all these requirements is actually proposed 
in the paper. 

Unfortunately, despite the fact that this approach considers a set 
of data models, though only closely related to aggregate-oriented 
ones, they are not considered in a combination but separately. The 
proposed strategies for a good aggregate design indicate how this 
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aim differs from the traditional SOL and NoSQL world and what 
should be taken into account to cover them both. 


Associative Arrays. The idea of associative arrays [11] enables us 
to logically (and possibly also physically) represent all data models 
using a single generic data structure providing an abstraction for 
many classes of databases (SQL, NoSQL, NewSQL, or graph). 

An associative array A (naturally corresponding to a table or 
matrix) is defined as a mapping from pairs of keys to values A : 
Kk, X Kz — YV, where ky are row keys, Kz column keys, both of 
which can be any sortable sets (integers, strings, etc.). The following 
rules must hold: (1) each row key and each column key is unique, 
and (2) there are no rows or columns that would be entirely empty. 
Basic operations for associative arrays are element-wise addition, 
element-wise multiplication, and array multiplication, which corre- 
spond, e.g., to the database operations of table union, intersection, 
and transformation. 


relational table T key/value pairs k wide-column table w 


address 


items/product | items/name | ..- 


Figure 5: Associative arrays for various models 


EXAMPLE 5. As shown in Figure 5, for relational J, key/value K, and 
wide-column W models, the representation in associative arrays is straightfor- 
ward. For graph model G, we can use any graph representation using matrices. 
Hierarchical data D can be represented as sparse arrays by traversing the 
hierarchy and incrementing/appending the row counters and column keys to 
emit row-column-value triples. Oo 


Tensor Data Model (TDM). The TDM [14] permits us to view 
multi-model data using the notion of a tensor. It is often denoted 
as a generalized matrix: 0-order tensor is a scalar, 1-order tensor 
is a vector, 2-order a traditional matrix, and tensors of order 3 or 
higher are called higher-order tensors. In TDM, tensor dimensions 
and values are defined as a map that associates keys to values as T : 
Ky, X...xKyn — V, where K; fori € {1,...,n} are sets of keys, n « N 
is a dimension, and V is the set of values. In addition, the tensors are 
named and typed. Tensor operations, analogously to operations on 
matrices and vectors, are multiplications, transposition, unfolding 
(transforming a tensor to a matrix), factorizations (decompositions), 
or other. 


EXAMPLE 6. Relational T, key/value K, wide-column W and graph 
G models are mapped to tensors straightforwardly, similarly to Figure 5. 
Multigraphs can be modeled by a 3-order tensor where one dimension is used 
to specify the different types of edges. Document model D is not considered in 
TDM. Oo 
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2.3 Integrity Constraints 


Integrity constraints representing various conditions the data must 
abide by form another component of the data modeling [7]. Ob- 
ject Constraint Language (OCL) [24], a part of UML, is one of the 
common approaches — a declarative and strongly typed language 
allowing to express complex integrity constraints. Such as those 
that would otherwise be difficult or even impossible to express 
using cardinalities or other constructs within the UML conceptual 
schemas themselves. 

Thorough enumeration of all kinds of constructs provided by 
OCL would be beyond the scope of this paper. Therefore, let us only 
shortly mention some of the constructs, namely those used most 
frequently. Although OCL supports expressions using which, for 
example, one can describe pre-conditions and post-conditions for 
methods and operations as well as rules for initial or derived values 
of attributes, invariants are apparently of the highest importance. 
Their purpose is to define assertions that instances of the static 
data must satisfy all the time. Invariants are written in a form of 
context type inv name : expression (though for our purpose a bit 
simplified), where name is an optional constraint name and type is 
a name of a specific type (UML class) such that its instances should 
satisfy the expression condition. This condition can be simple or 
complex, involving Boolean expressions with various logical con- 
nectives, navigation operators (so that we can go through UML 
associations and reach their counterparties), let expressions allow- 
ing us to perform auxiliary variable assignments, calls of various 
functions such as size() on collections, as well as, in particular, 
simulate the existential and universal quantifiers via exists() and 
forAll() functions. 


EXAMPLE 7. The following invariant ensures that each order of any cus- 
tomer (in the sample data) must have at least one ordered item: 
context Customer inv : 
self .Orders->forAll( o | o.Items->size() >= 1 ) Oo 


The authors of OCL pursued an objective to propose a language 
that would be formal enough (so that different interpretations would 
be avoided), yet a language with syntax still user-friendly enough 
(so that it could be used even without more profound mathematical 
skills). Although there are already approaches enabling the trans- 
formation of OCL conditions into SQL expressions for the relational 
model, as well as OCL itself respects the nature of multi-model data 
(since it resides at the platform-independent layer), its broader appli- 
cability in this context is not straightforward without appropriate 
support and tools. Furthermore, it is necessary to propose means 
how integrity constraints across models are to be represented, as 
well as implemented and validated, in particular. 


2.4 Knowledge Reasoning 


Description logics [3] form a family of formalisms for knowledge 
representation, one of the fields of artificial intelligence. It is also 
tightly related and widely applicable in the context of processing 
and modeling of RDF data, possibly enriched with RDFS schemas 
or OWL ontologies. The purpose of a description languages is to 
capture information about a part of the world in which we are 
interested. Working with the open-world assumption, the emphasis 
is put in the reasoning functionality via which one is able to infer 


245 


Multi-Model Data Modeling and Representation: State of the Art and Research Challenges 


new facts that are not provided explicitly in the database, i.e., the 
knowledge base. 

Basic building blocks are formed by concepts and roles, denot- 
ing sets of individuals and their binary relationships, respectively. 
Having elementary descriptions in a form of atomic concepts and 
roles, as well as the universal T concept (all the individuals) and 
bottom 1 concept (none individual), more complex descriptions 
of concepts can be obtained inductively by using various kinds of 
constructors: —A (atomic or general negation), AM B (intersection), 
AUB (union), V R.B (value restriction), 4 R.T and 5 R.B (limited and 
full existential quantifications), and, finally, > nR and < nR (at-least 
and at-most number restrictions), where A and B are concepts, R 
role, andnéN. 

Depending on the supported subset of the introduced construc- 
tors, individual representatives of particular description languages 
are distinguished. Their expressive power varies greatly, the AL 
(attributive language) being the minimal one of practical interest. 
Finding reasonable trade-offs is necessary in order to deal with 
tractability aspects, i.e., to ensure that reasoning and other related 
problems are decidable in polynomial time. 

A knowledge base system consists of two components: termi- 
nology (TBox) and assertions (ABox). While the purpose of the first 
component is to provide a set of terminological axioms in a form 
of inclusions (A E B) and equalities (A = B, usually in a form of 
definitions, where there are only symbolic names on their left-hand 
sides, and each one of them is defined at most once), the latter one 
describes the extensional data, i.e., assertions about named individ- 
uals. Formal semantics of these assertions can be well defined using 
a fragment of the first-order logic, in particular by unary predicates 
for atomic concepts, binary predicates for roles, and more complex 
formulae in the case of the derived concepts. Working with the 
description language expressions is, however, more convenient, not 
just because there is no need to use variables that would otherwise 
be needed in the translated formulae. 

More complicated expressions for derived concepts can actually 
be viewed as a kind of querying, yet at a layer of lower granularity 
focusing on individuals (entities) rather than their inner character- 
istics (attributes, properties, etc.), and so as if not fully exploiting 
the needs of multi-model data processing. 


EXAMPLE 8. Assuming that Orders? is a role describing a mapping 
from customers to orders, Q = Customer 1 V Friend.(> 3 Orders) then 
describes customers such that all their friends have at least 3 orders, i.e., of = 
Customer? N{c|c € At AVF : (cf) € Friend? > fe{wlwe 
A? A |{o|(w,o) € Orders’ }| > 3}}, where I is the interpretation 
structure with the domain of individuals A? . Oo 


Having complex concepts derived, we can perform the instance 
classification tests (whether a particular individual belongs to a 
given concept) or retrievals (acquiring a set of all individuals be- 
longing to a given concept). Besides the granularity, another limita- 
tion is brought by the idea that only binary roles are assumed, and 
derivation of complex roles is usually not considered. 


“We are aware of a widely adopted convention where concepts are usually named by 
nouns, while names of roles are derived from verbs (e.g., makesOrder instead of our 
Orders). However, we intentionally decided not to follow these principles and instead 
we use names that directly correspond to the names of individual entities as they are 
introduced in our multi-model scenario presented in Figure 1. 
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2.5 Evolution Management 


No matter how well a schema of data is designed, sooner or later 
user requirements may change and such changes then need to be 
appropriately reflected in both the data structures and all related 
parts of the system, too. 


Daemonx. In paper [27], the authors propose a five-level evo- 
lution management framework called Daemonx. For the forward- 
engineering design of an application, they consider a top-down 
approach starting from the design of a PIM and then mapping its 
selected parts to the respective single-model PSMs (followed by 
schema, operational, and extensional levels). The supported PSM 
models involve XML, relational, business-process, and REST. The 
authors focus on correct and complete propagation of changes be- 
tween the models, especially when the PSM schemas overlap, and 
the change must be propagated correctly to all the instances. In 
addition, mapping to queries and respective propagation of changes 
to the operational level is considered, too. 


EXAMPLE 9. DaemonX uses the classical UML class diagram (from Fig- 
ure 3) for the PIM level. Parts of the PIM (possibly overlapping) are then chosen 
and mapped to particular PSM diagrams. In Figure 6, we depict the situation 
for the relational J and document D models. For the relational PSM (violet), 
the ER model (from Figure 2) is used. For the document PSM (green), the X- 
SEM model is used. It enables us to model the hierarchy, repetition, and other 
features of semi-structured data. At the schema level — not depicted in the 
figure — there would be the database schema of relational tables and JSON 
documents expressed in respective languages. Oo 


The idea of the framework is general and extensible, so any model 
which can be mapped to the common general PIM schema can be 
added to the framework. However, the authors do not consider 
inter-model links between PSM schemas, cross-model queries, nor 
the storage of multi-model data in a single DBMS. 


MigCast. Another approach, MigCast [9], also focuses on the 
evolution management in the multi-model world, namely aggregate- 
oriented NoSQL systems. It utilizes a cost model based on the char- 
acteristics of the data, expected workload, data model changes, and 
cloud provider pricing policies, too. While the data migration can be 
eager, lazy, proactive, etc., each having its (dis)advantages, MigCast 
provides an estimation of their costs and helps the users to find 
optimal solutions for a given application. 


3 BROADER GENERALIZATION 


All the approaches in the previous section can be described as 
promising generalizations of particular aspects of the data manage- 
ment towards multiple models. However, the idea of generalization 
can go even further and cover more data processing aspects at a 
time using the same formal framework. 

Category theory [2] is a branch of mathematics that attempts 
to formalize various widely used and studied structures in terms 
of categories, i.e., in terms of directed labeled graphs composed 
from objects representing graph vertices, and morphisms (or equiva- 
lently also arrows), i.e., mappings between the objects, representing 
directed graph edges. 

Formally, a category C consists of a set of objects obj(C) anda 
set of morphisms mor(C), each of which is modeled and depicted 
as an arrow f : A — B, where A,B € obj(C), A being treated as a 
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Figure 6: PIM and PSM levels in DaemonX 


domain and B as a codomain, respectively. Whenever f, g € mor(C) 
are two morphisms f : A — Bandg: B — C, it must hold 
that g o f € mor(C), ie., morphisms can be composed using the o 
operation and the composite g o f must also be a morphism of the 
category (transitivity of morphisms is required). Moreover, o must 
satisfy the associativity, ie., ho (go f) = (hog) of for any triple of 
morphisms f,g,h € mor(C), f: A> B,g:B—>C,andh:C > D. 
Finally, for every object A, there must exist an identity morphism 
1,4 such that fol, = f = 1Rp0f for any f : A > B (obviously 
serving as a unit element with respect to the composition). 

Although objects in real-world categories usually tend to be 
sets of certain items and morphisms functions between them, both 
objects and morphisms may actually represent abstract entities of 
any kind. 


EXAMPLE 10. Set (as widely denoted) is a category where objects are 
arbitrary sets (not necessarily finite), and morphisms are functions between 
them (not necessarily injective nor surjective), together with the traditionally 
understood composition of functions and identities. 

Having a graph G = (V,E), where V is a set of vertices and E € V x V is 
a set of directed edges, we could derive another category where objects are the 
original vertices and morphisms simply the edges, composition 0 producing 
ordinary edges forming kind of shortcuts for the collapsed paths, and identities 
working as loops. Apparently, such a structure could but may not necessarily 
define a well-formed category, since it may happen that for any two edges 
(morphisms) f = (a,b) andg = (b,c) € E the composite go f = (a,c) ¢ E, 
i.e., the composed edge may not be in the graph. Oo 


Category theory can also be applied to data processing (e.g., 
modeling, representation, transformation, querying, etc.). Actually, 
such proposals already exist, though not robust enough and focus- 
ing mostly just on the relational model alone (as a single-model 
solution only). 

In the following text, we first describe an existing approach 
using which schemas of relational databases can be modeled (Sec- 
tion 3.1). Based on it, we then propose its possible extension to- 
wards the multi-model scenario to illustrate the challenges involved 
(Section 3.2). Returning back to the existing approaches, we then 
discuss how instances of relational databases (the actual data) can 
be modeled (Section 3.3), and, last but not least, and at a higher 
level of abstraction, we describe how categories of all the possible 
schemas or instances (Section 3.4) can be utilized for the purpose 
of transformations or querying (Section 3.5). 
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3.1 Relational Schema Category 


The approach by Spivak [30] allows for a representation of schemas 
of relational databases in terms of categories. Suppose we have a 
schema of a relational database, i.e., a set of named tables, each 
with a compulsory single-column primary key* and other columns 
(if any), possibly interlinked with foreign keys. The corresponding 
category T (let us call it a schema category) will have objects of 
two sorts: (1) objects for tables and (2) objects for generalized data 
types (e.g., number, string, ...). As for the morphisms: (1) for every 
table t and its column c (other than the primary key and foreign 
key columns) of data type d, there will be a morphism c : t — d, 
(2) for every table t, there will be an implicit identity morphism 
id; : t — t allowing us to represent the primary key of t, (3) for 
every data type d an implicit identity morphism idg : d — d (they 
will not become useful later on, but they must be there to make 
the category well-formed), and (4) for every foreign key from a 
column c of table t; referencing the primary key of tz, there will be 
a morphism c : t} — te. 

To make our description complete, let us also mention that the 
approach also focuses on integrity constraints [31], though we will 
not discuss them further. 


EXAMPLE 11. The abstract schema representation of the relational table 
JT from Example 1 is depicted in Figure 7. Object customer represents the 
table itself (depicted as a full circle), objects string and number are general- 
ized primitive types (as empty circles), and name : customer — string, 
address : customer — string, and credit : customer — number are 
morphisms for the individual columns. 

To simplify the figure, we omitted the visualization of all the involved 
identity morphisms: the one for the primary key of table customer, as well as 


both the identity morphisms on the involved data type objects. Oo 
customer 
name | address e | credit 
© string © number 


Figure 7: Schema for table 7 


3Unfortunately, other situations such as primary keys consisting of multiple columns 
or unique keys are not considered by the author. In case there is no primary key, an 
implicit one needs to be assumed. 
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3.2 Multi-Model Schema Category 


Even though the presented approach considers only the relational 
model alone (and, unfortunately, with various limitations), perhaps, 
we could try to go beyond the originally proposed principles and 
context. In particular, and as presented in the following example, 
we could attempt to design a draft of an abstract schema that would 
attempt to cover the entire multi-model scenario. 


EXAMPLE 12. Building on top of the data structures as they are defined 
at the logical layer in our sample multi-model scenario, the resulting schema 
draft could be constructed as it is depicted in Figure 8. Adopting the objects 
and morphisms for the relational table J from the previous example, we just 
need to add new ones for the remaining parts of the scenario. Namely, friend 
morphism from graph G, objects order and item and the related morphisms 
from documents D, morphism orders from column family W, and, finally, 
morphism cart derived from key/values in K, all along with appropriate data 
type objects binary and boolean. Oo 


cart customer orders 


name address friend | credit 


vv 
© number 


© binary © string 


name product price ] quantity 


« 


O boolean 


paid 
e 


item items order 


Figure 8: Schema category for the multi-model scenario 


It may seem (at least from the first point of view) that categories 
such as these really could cover multi-model schemas. However, 
there are (not just) the following issues, questions, or challenges 
directly resulting from the draft we have just outlined. For example, 
how the contents of key/value pairs should be represented (whether 
as binary black boxes or unfolded structures), how collections and 
their members should be modeled (e.g., items of JSON arrays), how 
embedded structures should be treated (e.g., JSON subdocuments), 
what directions of morphisms should be selected, how duplication 
of morphisms and consistency of shared morphisms should be 
handled (e.g., names of customers in both J and G), how compound 
primary keys (or other identifiers) should be represented, how 
the reverse decomposition into optimal logical schemas should be 
performed (similarly as in NoAM), or how the knowledge available 
at the conceptual layer could, in general, be exploited. 


3.3 Relational Instance Category 


Let us now return back to the existing approaches and continue 
with categories using which we are able to represent particular 
data, i.e., instances of relational databases. In order to follow this 
idea, Spivak introduces a category Sety (let us call it an instance 
category). Its purpose is to represent one particular data instance 
of a relational database conforming to a schema T. Apparently, for 
every possible database content, i.e., for every possible instance, a 
different category Sety will be constructed. 

Similarly to a schema category T itself, the newly introduced 
instance category Sety will have an object for every table and every 
data type. The former ones are internally modeled as sets of all the 
actively occurring values of the corresponding primary keys, while 
the latter ones are internally modeled as sets of all the possible 
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values (domains) of the corresponding primitive data types. As for 
the morphisms, they are also analogous to the morphisms in the 
original T. In particular, the purpose of the column morphisms is to 
allow us to assign particular column values to the individual values 
of the primary key, and, as a consequence, permit us to reconstruct 
the individual tuples (rows) of a corresponding table. 


EXAMPLE 13. Having just a single relational table T from Example 1 
with its schema already modeled using a schema category T, the instance 
category Sety describing the data in our database has the following compo- 
nents: table object customer internally representing a set of values {1, 2,3}, 
data type objects string and number representing the corresponding do- 
mains, column morphism name : customer — string with mappings 
{(1, Mary”), (2,”Anne”), (3,”John”)}, morphism address : customer 
— string defined as {(1,”...”), (2,”...”), (3,”...”)}, and, finally, mor- 
phism credit : customer — integer materialized as {(1, 3000), (2, 2000), 
(3, 5000) }. Oo 


3.4 Higher-Level Categories 


One of the interesting features of category theory is that it easily 
permits us to work at different levels of abstraction. Until now, 
we provided several sample categories, including schema T and 
instance Sety, the former allowing for a description of a partic- 
ular database schema, the latter one description of a particular 
database instance. In order to be able to model more complicated 
database-related processes and concepts, we need to be capable 
of constructing categories over different categories. And for this 
purpose, we need to introduce the following notion of functors. 

Assuming that C and D are categories, a functor F : C — D 
is a mapping of objects and morphisms such that the following 
conditions are satisfied: (1) each object A € obj(C) is transformed 
to a corresponding object F(A) € obj(D) (this mapping may not 
be bijective); (2) each morphism f € mor(C), f : A — B is mapped 
to a morphism F(f) : F(A) — F(B), specifically ensuring that 
F(14) = 1p(a) for any A € obj(C), and also guaranteeing that 
F(go f) = F(g) ° F(f) for any suitable f,g € mor(C). 

According to Spivak, we are now ready to introduce yet another 
two new categories, in particular T-Schema and T-Inst. While the 
purpose of the first one is to describe all the possible schemas of 
relational databases, the purpose of the second one is to describe 
all the possible instances of relational databases (regardless of their 
schemas). Let us have a look at the details at least briefly. 

Category T-Schema contains one object for every potentially 
existing relational schema, each one of them always internally mod- 
eled using a particular schema category T. Morphisms between 
pairs of these categories, i.e., functors, allow us to model permitted 
changes of these schemas in terms of the introduced set of primitive 
schema-altering operations. Analogously, category T-Inst contains 
one object for every potentially existing database instance, each one 
of them modeled as a particular instance category Sety. Morphisms 
(once again functors) then describe the permitted transitions be- 
tween the individual instances (where not just the actual data can 
be changed, but schemas altered as well). 


3.5 Transformations and Querying 


While the approaches by Spivak presented so far only work with 
the relational model, an extended approach proposed by Liu et 
al. [15] considers the multi-model scenario. Using it, we will be able 
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relational table Set, 


relational table Setyo 
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document Setp 
{ "customers" : [ { 
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Vea | 3000 SSS "name" : "Mary", 
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fees | 5000 "credit" : 3000 
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Figure 9: Sample inner and extra-model transformation 


to work with multi-model data, yet only to a certain extent, because 
all the individual involved models are assumed to be separate (i.e., 
there cannot exist links or other kinds of relationships between the 
data in different models, which is usually not the case of real-world 
multi-model databases). 

Supposing we have a schema category S and an instance category 
Sets for every involved logical model S in our database (relational, 
document, etc.), we can illustrate how extra-model transformations 
could work, at least shortly as presented in the following example. 


EXAMPLE 14. A sample transformation between tables in the relational 
model and an extra-model transformation between a table and a document 
is depicted in Figure 9. Both the tables and document contain the same data 
(as for their actual information content), but they differ in the structure (and, 
obviously, models, too). 

Consider we have two instance categories for tables Sety, and Sety, and a 
functor F : Sety, — Sety, (describing a permitted data transformation within 
the relational model). We also have an instance category for document Setp. 
Obviously, functor G : Sety, — Setp can be exploited to model the intended 
extra-model data transformations between the relational and document model. 

oO 


It is easy to realize that the outlined transformations can be 
applied at all the introduced points of view: (1) we can work with 
schema alterations of pairs of particular schema categories S; and 
S2, (2) we can study transformations of pairs of instance categories 
Sets, and Sets,, and, last but not least, (3) we can grasp all the 
possible transformations at the level of functors between pairs of 
schema categories as well as pairs of instance categories. 

To conclude, the introduced concept of transformations can also 
be further used for the purpose of multi-model data querying. In par- 
ticular but without details, Liu et al. [15] proposed to use (1) objects 
to represent individual data instances (such as tables or documents), 
(2) morphisms and functors to represent filtering conditions, and 
(3) pullbacks (as a generalization of intersection and inverse image) 
to represent joins of data from distinct models. 


4 OPEN PROBLEMS 


It may seem that there exists at least one approach for each aspect 
of multi-model data modeling and representation we covered in 
this paper. However, for each one of them, there, in fact, remain 
open questions that need to be solved first. 


Conceptual Layer. Conceptual modeling approaches naturally 
support multi-model data, because they intentionally hide specifics 
of the individual platform-specific representations. Unfortunately, 
ER is not well-standardized, and UML class diagrams are concealing 
important details. Moreover, not just the following multi-model 
aspects may need to be figured out differently than the mentioned 
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traditional approaches assume (most likely because they are, never- 
theless, closely related to the relational model at the logical layer): 
whether each entity type must have at least one identifier or not, 
whether there can be multi-valued identifiers, whether the actual 
values of multi-valued attributes must be mutually distinct, whether 
there can only be flat structured attributes or attributes with arbi- 
trary tree structures, whether such embedded attributes may be 
multi-valued, whether relationship types may have their own iden- 
tifiers, whether all participating entity types in n-ary relationship 
types really need to be involved in weak identifiers or just some of 
them, etc. Furthermore, a fundamental question is whether we even 
need to distinguish between entity types, relationship types, and 
attributes. Nevertheless, these aspects need to be accordingly dis- 
cussed so that extended and adjusted approaches that fully respect 
the principles of multi-model data processing can be proposed. 


Logical Layer. There exist several data structures covering the 
widely used data models based on the idea of mapping of sets of keys 
to values. However, especially the difference between the graph 
model and other models brings challenging questions regarding the 
desired natural, efficient, and unified representation. And there are 
other critical aspects, such as mapping of multi-model PSM(s) to 
PIM [34], mapping between multi-model logical and multi-model 
operational layers [6, 26], or logical schema inference from sample 
multi-model data [4, 19]. The design of the multi-model logical 
layer also strongly influences the other related layers as well as the 
mutual relations between them. 


Integrity Constraints. Integrity constraints, as they are under- 
stood by OCL, represent a robust conceptual approach for describ- 
ing invariants and other consistency requirements the data must 
conform to. Obviously, these principles may also be adopted to 
the multi-model scenario. However, users must be provided with 
practically exploitable tools. Otherwise, they will not be willing to 
spend time by creating such abstract descriptions. Another obstacle 
is that most reasoning tasks are known to be undecidable when 
the full expressive power of UML and OCL languages is considered. 
Therefore three main decidable fragments were proposed [25, 28]: 
(1) UML only with no OCL, (2) UML with limited OCL and no 
maximum cardinality constraints (OCL-Lite), and (3) UML with 
limited OCL with no minimum cardinality constraints (OCLynyv). 
The problem is that real-world UML schemas often use OCL to- 
gether with min and max cardinalities, which therefore represents 
a limitation of the existing approaches. Fortunately, even OCLuNiv 
can be decidable under certain assumptions [25]. 


Knowledge Reasoning. Because of the theoretical foundations 
and complexity of the knowledge representation and reasoning, the 
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data model must stay simple, only involving individuals, their con- 
cepts, and binary roles. Although it is possible to take multi-model 
data and decompose it into such units with low granularity, the idea 
itself breaches the very principles of multi-model data processing, 
where the variety, complexity, and semi-structured nature would 
presumably be difficult to grasp. Therefore it is perhaps question- 
able whether it even makes sense to deploy knowledge reasoning 
approaches in the context of multi-model data. Nevertheless, one 
can get at least inspired by the so-called OBDA (Ontology-based 
Data Access) techniques [13, 29], where the goal is to integrate 
heterogeneous sources of data so that they can then be accessed at 
a higher level of abstraction through ontologies and their mutual 
mappings. A different perspective is assumed by these approaches, 
though. While OBDA deals with issues such as data quality or pro- 
cess specification, treats the data from the user perspective, and 
focuses on the business value of the data, we are rather interested 
in data representation and access patterns at the logical layer in the 
multi-model database scenario. 


Evolution Management. Evolution management is a difficult chal- 
lenge even in a single-model scenario [23]. Inter-model propagation 
of changes and data migration bring another dimension of complex- 
ity [33]. And we also need to consider propagation of changes to 
operations [6, 20, 26] or storage strategies, including data migration 
between the models [12]. The existing (single-model or aggregate- 
oriented) approaches provide first steps and promising directions, 
but a robust, generally applicable, and extensible multi-model solu- 
tion is still missing. 


Broader Generalization. Category theory represents a promising 
framework that could bring an interesting level of further gen- 
eralization. It can be used for an abstract representation of data 
models (both schemas as well as data instances), data and/or schema 
transformations, or to capture the semantics of queries and query 
rewriting. One of the advantages of category theory is that it easily 
allows us to work at different levels of abstraction. For example, 
we can have a category that models all the possible database in- 
stances within a particular model [32] and use it to describe schema 
modifications as well as data transformations like inserts, updates, 
deletes, etc. Nevertheless, the existing single-model approaches are 
not mature enough and are not in compliance with characteristics 
of schemas and data in real-world databases, etc. 


To summarise the ideas, conceptual modeling itself, integrity con- 
straints, as well as knowledge reasoning represent approaches that 
could all be placed into the conceptual layer. ER or UML modeling 
languages, in fact, define the meaning and the very purpose of this 
layer, OCL is straightforwardly built on top of it, and description 
logics simplify the assumed data model to just atomic individuals. 
On the other hand, the purpose of particular models at the logical 
layer is to provide data structures that are indeed used for data repre- 
sentation. While there are plenty of dedicated ones, both single and 
multi-model, such as the traditional relational and models newly 
revisited or introduced by the family of NoSQL database systems 
(key/value, wide-column, ...) or their multi-model fellows, there 
are also unifying approaches, such as NoAM, associative arrays, 
or tensors. Besides the actual data modeling and representation, 
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other tasks such as data transformation, querying, or evolution are 
tightly bound to the logical layer, too. 

When pursuing the outlined vision of the unified processing of 
multi-model data [10], we believe that a new inter-layer between 
the conceptual and logical ones needs to be introduced and well 
established. In a top-down manner, we can start at the conceptual 
layer with only too abstract representations and try to modify, 
extend, and tailor them so that they fit the needs and specifics 
of the various multi-model scenarios. On the contrary, following 
the bottom-up direction, particular logical representations can be 
combined, extended, and uplifted to meet the expectations, too. 

Only then the truly unified multi-model data handling will be 
possible. It needs to be anchored by formally solid foundations, 
but with particular techniques, languages and principles still user- 
friendly enough, i.e., not burdened with unmanageable complexities 
that would prevent their direct applicability and wide dissemination. 
Such a fully-fledged inter-layer would need to enable the unified 
multi-model data modeling, transformation, description of schemas, 
their inference, querying, evolution, or automated database tuning, 
to name at least a few areas. 


5 CONCLUSION 


The number of existing multi-model systems grows every day. How- 
ever, reaching a practical and widely usable level of maturity and 
robustness also requires a strong formal background and generally 
applicable solutions. The described and justified unification is de- 
sired even because it is not likely, at least in a long-term perspective, 
that users will be willing to cope with a multitude of specific and 
often proprietary existing abstractions, languages, and techniques. 
In order to accomplish the envisioned management of multi- 
model data, a wide range of not just the following open questions 
we described will, however, need to be appropriately addressed: 


e Data representation: conceptual modeling of multi-model 
data, generic or unifying data structures, co-existence of 
multi-model and single-model scenarios, inter-model refer- 
ences and embedding 

e Schema design: description of multi-model schemas, integrity 
constraints and their validation, data (de)normalization, sche- 
ma inference from sample data, optimal schema decomposi- 
tion between models 

e Unified querying: user-friendly query language, well-defined 
syntax and semantics, unified processing of multi-model 
data, query rewriting from/to existing languages 

e Evolution management: intra-model and inter-model schema 
modification, propagation of changes to data and queries, 
data migration between models 

e Database tuning: autonomous model selection, integration 
of new models, on-the-fly data transformation, co-existence 
of replicas in distinct models, load balancing 
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ABSTRACT 


With new emerging technologies, such as satellites and drones, 
archaeologists collect data over large areas. However, it becomes 
difficult to process such data in time. Archaeological data also have 
many different formats (images, texts, sensor data) and can be 
structured, semi-structured and unstructured. Such variety makes 
data difficult to collect, store, manage, search and analyze effectively. 
A few approaches have been proposed, but none of them covers the 
full data lifecycle nor provides an efficient data management system. 
Hence, we propose the use of a data lake to provide centralized data 
stores to host heterogeneous data, as well as tools for data quality 
checking, cleaning, transformation and analysis. In this paper, we 
propose a generic, flexible and complete data lake architecture. Our 
metadata management system exploits goldMEDAL, which is the 
most generic metadata model currently available. Finally, we detail 
the concrete implementation of this architecture dedicated to an 
archaeological project. 
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1 INTRODUCTION 


Over the past decade, new forms of data such as geospatial data and 
aerial photography have been included in archaeology research [8], 
leading to new challenges such as storing massive, heterogeneous 
data, high-performance data processing and data governance [4]. 
As a result, archaeologists need a platform that can host, process, 
analyze and share such data. 

In this context, a multidisciplinary consortium of archaeologists 
and computer scientists proposed the HyperThesau project!, which 
aims at designing a data management and analysis platform. Hyper- 
Thesau has two main objectives: 1) the design and implementation 
of an integrated platform to host, search, analyze and share archaeo- 
logical data; 2) the design of an archaeological thesaurus taking the 
whole data lifecycle into account, from data creation to publication. 

Classical data management solutions, i.e., databases or data ware- 
houses, only manage previously modeled structured data (schema- 
on-write approach). However, archaeologists need to store data of 
all formats and they may discover the use of data over time. Hence, 
we propose the use of a data lake [2], i.e., a scalable, fully integrated 
platform that can collect, store, clean, transform and analyze data 
of all types, while retaining their original formats, with no prede- 
fined structure (schema-on-read approach). Our data lake, named 
ArchaeoDAL, provides centralized storage for heterogeneous data 
and data quality checking, cleaning, transformation and analysis 
tools. Moreover, by including machine learning frameworks into 
ArchaeoDAL, we can achieve descriptive and predictive analyses. 

Many existing data lake solutions provide architecture and/or 
implementation, but few include a metadata management system, 
which is nevertheless essential to avoid building a so-called data 
swamp, i.e., an unexploitable data lake [6, 12]. Moreover, none of 
the existing metadata management systems can provide all the 
needed metadata features we need. For example, in archaeology, 
thesauri are often used for organizing and searching data. There- 
fore, the metadata system must allow users to define one or more 
thesauri, associate data with specific terms and create relations 
between terms, e.g., synonyms and antonyms. Thus, we conclude 
that existing data lake architectures, including metadata systems, 
are not generic, flexible and complete enough for our purpose. 

To address these problems, we propose in this paper a generic, 
flexible and complete data lake architecture. Moreover, our metadata 


‘https://imu.universite-lyon.fr/projet/hypertheseau-hyper-thesaurus-et-lacs-de- 
donnees-fouiller-la-ville-et-ses- archives-archeologiques-2018/ 
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model exploits and enriches goldMEDAL, which is the most generic 
metadata model currently available [13]. To illustrate the flexibil- 
ity and completeness of ArchaeoDAL’s architecture, we provide 
a concrete implementation dedicated to the HyperThesau project. 
With a fully integrated metadata management and security sys- 
tem, we can not only ensure data security, but also track all data 
transformations. 

The remainder of this paper is organized as follows. In Section 2, 
we review and discuss existing data lake architectures. In Section 3, 
we present ArchaeoDAL’s abstract architecture, implementation 
and deployment. In Section 4, we present two archaeological appli- 
cation examples. In Section 5, we finally conclude this paper and 
present future works. 


2 DATA LAKE ARCHITECTURES 


The concept of data lake was first introduced by Dixon [2] in asso- 
ciation with the Hadoop file system, which can host large heteroge- 
neous data sets without any predefined schema. Soon after, the data 
lake concept was quickly adopted [3, 10]. With the growing popu- 
larity of data lakes, many solutions were proposed. After studying 
them, we divide data lake architectures into two categories: 1) data 
storage-centric architecture; 2) data storage and processing-centric 
architecture. 


2.1 Data Storage-Centric Architectures 


In the early days, a data lake was viewed as a central, physical 
storage repository for any type of raw data, aiming for future insight. 
In this line, Inmon proposes an architecture that organizes data 
by formats, in so-called data ponds [6]. The raw data pond is the 
place where data first enters the lake. The analog data pond stores 
data generated by sensors or machines. The application data pond 
stores data generated by applications. Finally, the textual data pond 
stores unstructured, textual data. 

Based on such zone solutions, Gorelik proposes that a common 
data lake architecture includes four zones [5]: a landing zone that 
hosts raw ingested data; a gold zone that hosts cleansed and en- 
riched data; a work zone that hosts transformed, structured data 
for analysis; and a sensitive zone that hosts confidential data. Bird 
also proposes a similar architecture [1]. Such architectures organize 
data with respect to how deeply data are processed and security 
levels. 

The advantage of storage-centric architectures is that they pro- 
vide a way to organize data inside a data lake by default. However, 
the predefined data organization may not satisfy the requirements 
of all projects. For example, HyperThesau needs to store data from 
different research entities. Thus, one requirement is to organize 
data by research entities first. Yet, the bigger problem of storage- 
centric architectures is that they omit important parts of a data 
lake, e.g., data processing, metadata management, etc. 


2.2 Data Storage and Processing-Centric 
Architectures 


With the evolution of data lakes, they have been viewed as plat- 
forms, resulting in more complete architectures. Alrehamy and 
Walker propose a “Personal Data Lake’ architecture that consists of 
five components [15], which addresses data ingestion and metadata 
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management issues. However, it transforms data into a special JSON 
object that stores both data and metadata. By changing the original 
data format, this solution contradicts the data lake philosophy of 
conserving original data formats. 

Pankaj and Tomcy propose a data lake architecture based on the 
Lambda architecture (Figure 1) that covers almost all key stages of 
the data life cycle [14]. However, it omits metadata management 
and security issues. Moreover, not all data lakes need near real-time 
data processing capacities. 


Serving Layer 


Speed Layer 


Batch Layer 


Messaging Layer 


Lambda Layer 


Data Storage Layer 


Figure 1: Lambda data lake architecture [14] 
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Mehmood et al. propose an interesting architecture consisting 
of four layers: a data ingestion layer that acquires data for storage; 
a data storage layer; a data exploration and analysis layer; a data vi- 
sualization layer [9]. This architecture is close to our definition of a 
data lake, i.e., a fully integrated platform to collect, store, transform 
and analyze data for knowledge extraction. Moreover, Mehmood 
et al. provide an implementation of their architecture. However, 
although they mention the importance of metadata management, 
they do not include a metadata system in their architecture. Eventu- 
ally, data security is not addressed and data visualization is the only 
proposed analysis method. Raju et al. propose an architecture that 
is similar to Mehmood et al’s [11]. They essentially use a different 
tool-set to implement their approach and also omit to take metadata 
management and data security into account. 


2.3 Discussion 


In our opinion, a data lake architecture must be generic, flexible 
and complete. Genericity implies that the architecture must not rely 
on any specific tools nor frameworks. Flexibility means that users 
must be able to define their own ways of organizing data. Com- 
pleteness means that not only functional features (e.g., ingestion, 
storage, analysis, etc.) must be handled, but also non-functional 
features (e.g., data governance and data security). Table 1 provides 
an evaluation of seven data lake architectures with respect to these 
three properties. 

The solutions by Mehmood et al. and Raju et al. are not generic be- 
cause their architecture heavily relies on certain tools. The zone ar- 
chitectures by Inmon, Gorelik and Bird are not flexible, because they 
force a specific data organization. Finally, Alrehamy and Walker’s 
platform is the only complete architecture that addresses data gov- 
ernance and security, but is not a canonical data lake. 
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Table 1: Comparison of data lake architectures 


Architecture Generic Flexible Complete 
Alrehamy and Walker (2015) J NA 
Inmon (2016) J 

Pankaj and Tomcy (2017) NA J 

Raju et al. (2018) NA 

Bird (2019) J 

Gorelik (2019) NA 

Mehmood et al. (2019) J 


3 ARCHAEODAL’S ARCHITECTURE, 
IMPLEMENTATION AND DEPLOYMENT 


In this section, we propose a generic, flexible abstract data lake 
architecture that covers the full data lifecycle (Figure 2) and contains 
eleven layers. The orange layers (from layer 1 to layer 6) cover 
the full data lifecycle. After data processing in these layers, data 
become clearer and easier to use for end-users. The yellow layers 
cover non-functional requirements. After the definition of each 
layer, we present how each layer is implemented in our current 
ArchaeoDAL instance. As this instance is dedicated to the project 
HyperThesau, it does not cover all the features of the abstract 
architecture. For example, real-time data ingestion and processing 
are not implemented. However,real-time or near real-time data 
ingestion and processing feature can be achieved by adding tools 
such as Apache Storm? or Apache Kafka? in the data ingestion and 
data insights layers. 


6.Application 


Data 


visualization 


3. Data Storage 


Interactive 


1. Data Source 


5. Data Insights 


Figure 2: ArchaeoDAL’s architecture 


3.1 Data Source Layer 


In the data source layer, we gather the basic properties of data 
sources, e.g., volume, format, velocity, connectivity, etc. Based on 
these properties, data engineers can determine how to import data 


“https://storm.apache.org/ 
$https://kafka.apache.org/ 
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into the lake. If metadata are required, data engineers must also 
find the best fitting metadata model to govern input data. 

In our instance of ArchaeoDAL, we have data sources such as 
relational databases and various files stored in the archaeologists’ 
personal computers. 


3.2 Data Ingestion Layer 


The data ingestion layer provides a set of tools that allow users 
to perform batch or real-time data ingestion. Based on the data 
source properties that are gathered in the data source layer, data 
engineers can choose the right tools and plans to ingest data into 
the lake. They must also consider the capacity of the data lake 
to avoid data loss, especially for real-time data ingestion. During 
ingestion, metadata provided by the data sources, e.g., the name 
of excavation sites or instruments, must be gathered as much as 
possible. After data are loaded into the lake, we may lose track 
of data sources. It is indeed more difficult to gather this kind of 
metadata without knowledge about data sources. 

ArchaeoDAL’s implementation exploits Apache Sqoop‘ to ingest 
structured data and Apache Flume?’ to ingest semi-structured and 
unstructured data. Apache Sqoop can efficiently transfer bulk data 
from structured data stores such as relational databases. Apache 
Flume is a distributed service for efficiently collecting, aggregating 
and moving large amounts of data. For one-time data ingestion, 
our instance provides Sqoop scripts and a web interface to ingest 
bulk data. For repeated data loading, we developed Flume agents 
to achieve automated data ingestion. 


3.3 Data Storage Layer 


The data storage layer is the core layer of a data lake. It must have 
the capacity to store all data, e.g., structured, semi-structured and 
unstructured data, in any format. 

ArchaeoDAL’s implementation uses the Hadoop Distributed File 
System® (HDFS) to store ingested data, because HDFS stores data 
on commodity machines and provides horizontal scalability and 
fault tolerance. As a result, we do not need to build large clusters. 
We just add nodes when data volume grows. To better support the 
storage of structured and semi-structured data, we add two tools: 
Apache Hive’ to store data with explicit data structures and Apache 
HBase® that is a distributed, versioned, column-oriented database 
that provides better semi-structured data retrieval speed. 


3.4 Data Distillation Layer 


The data distillation layer provides a set of tools for data cleaning 
and encoding formalization. Data cleaning refers to eliminating 
errors such as duplicates and type violations, e.g.,a numeric column 
contains non-numeric values. Data encoding formalization refers 
to converting various data and character encoding, e.g., ASCII, ISO- 
8859-15, or Latin-1, into a unified encoding, e.g., UTF-8, which 
covers all language symbols and graphic characters. 


“https://sqoop.apache.org/ 
*https://flume.apache.org/ 
°https://hadoop.apache.org/ 
Thttps://hive.apache.org/ 
Shttps://hbase.apache.org/ 
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ArchaeoDAL’s implementation uses Apache Spark? to clean and 
transform data. We developed a set of Spark programs that can 
detect duplicates, NULL values and type violations. Based on the 
percentage of detected errors, we can hint at data quality. 


3.5 Data Insights Layer 


The data insights layer provides a set of tools for data transfor- 
mation and exploratory analysis. Data transformation refers to 
transforming data from one or diverse sources into specific mod- 
els that are ready to use for data application, e.g., reporting and 
visualization. Exploratory data analysis refers to the process of 
performing initial investigations on data to discover patterns, test 
hypotheses, eliminate meaningless columns, etc. Transformed data 
may also be persisted in the data storage layer for later reuse. 

ArchaeoDAL’s implementation resorts to Apache Spark to per- 
form data transformation and exploratory data analysis. Spark also 
provides machine learning libraries that allow developers to per- 
form more sophisticated exploratory data analyses. 


3.6 Data Application Layer 


The data application layer provides applications that allow users to 
extract value from data. For example, a data lake may provide an 
interactive query system to do descriptive and predictive analytics. 
It may also provide tools to produce reports and visualize data. 

In ArchaeoDAL’s implementation, we use a Web-based note- 
book, Apache Zeppelin!®, as the front end. Zeppelin connects to 
the data analytics engine Apache Spark that can run a complex 
directed acyclic graph of tasks for processing data. Our notebook 
interface supports various languages and their associated Appli- 
cation Programming Interfaces (APIs), e.g., R, Python, Java, Scala 
and SQL. It provides a default data visualization system that can be 
enriched by Python or R libraries. 

ArchaeoDAL also provides a web interface that helps users down- 
load, upload or delete data. 


3.7. Data Governance Layer 


The data governance layer provides a set of tools to establish and 
execute plans and programs for data quality control [7]. This layer 
is closely linked to the data storage, ingestion, distillation, insights 
and application layers to capture all relevant metadata. A key com- 
ponent of the data governance layer is a metadata model [12]. 


3.7.1 Metadata Model. In ArchaeoDAL, we adopt goldMEDAL [13], 
which is modeled at the conceptual (formal description), logical 


(graph model) and physical (various implementations) levels. goldMEDAL 


features four main metadata concepts (Figure 3): 1) data entities, i.e., 
basic data units such as spreadsheet tables or textual documents; 
2) groupings that bring together data entities w.r.t. common proper- 
ties in groups; 3) links that associate either data entities or groups 
with each other; and 4) processes, i.e., transformations applied to 
data entities that produce new data entities. All concepts bear meta- 
data, which make gol4MEDAL the most generic metadata model in 
the literature, to the best of our knowledge. 


*https://spark.apache.org/ 
10https://zeppelin.apache.org/ 
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Figure 3: goldMEDAL’s concepts [13] 


However, we encountered a problem of terminology variation 
when creating metadata. goldMEDAL does indeed not provide ex- 
plicit guidance for metadata creation. This can lead to consistency 
and efficiency problems. For example, when users create data enti- 
ties, they have their own way of defining attributes, ie., key/value 
pairs that describe the basic properties of data. Without a universal 
guideline or template, every user can invent his own way. The num- 
ber, name and type of attributes can be different. As a result, without 
explicit guidelines or templates, it may become quite difficult to 
retrieve or search metadata. 

Thus, we enrich goldMEDAL with a new concept, data entity 
type, which explicitly defines the number, names and types of the 
attributes in a data entity. 

All data entity types form a type system that specifies how meta- 
data describe data inside the lake. Since each and every data lake 
has specific requirements on how to represent data to fulfill domain- 
specific requirements, metadata must contain adequate attributes. 
Thus, we need to design a domain-specific type system for each 
domain. For example, in the HyperThesau project, users need not 
only semantic metadata to understand the content of data, but also 
geographical metadata to know where archaeological objects are 
discovered. As a result, the type system is quite different from other 
domains. 

In sum, the benefits of having a data entity type system include: 
1) consistency, a universal definition of metadata can avoid terminol- 
ogy variations that may cause data retrieval problems; 2) flexibility, 
a domain-specific type system helps define specific metadata for 
requirements in each use case; 3) efficiency, with a given meta- 
data type system, it is easy to write and implement search queries. 
Because we know in advance the names and types of all meta- 
data attributes, we can filter data with metadata predicates such as 
upload_date > 10/02/2016. 


3.7.2 Thesaurus Modeling. Although thesauri, ontologies and tax- 
onomies can definitely be modeled with goldMEDAL, its specifica- 
tions do not provide details on how to conceptually and logically 
model such semantic resources, while we especially need to manage 
thesauri in the HyperThesau project. 


255 


ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics 


A thesaurus consists of a set of categories and terms that help 
regroup data. A category may have one and only one parent. A 
category without a parent is called the root category. A category 
may have one or more children. The child of a category is called 
subcategory or term. A term is a special type of category that has 
no child but must have a parent category. A term may have rela- 
tionships with other terms (related words, synonyms, antonyms, 
etc.). 

Fortunately, categories and terms can easily be modeled as data 
entities and structured with labeled links, with labels defined as link 
metadata. It is also easy to extend this thesaurus model to represent 
ontologies or taxonomies. 


3.7.3. Data Governance Implementation. The HyperThesau project 
does not require sophisticated data governance tools to fix decision 
domains and decide who takes decisions to ensure effective data 
management [7]. Thus, ArchaeoDAL’s data governance layer im- 
plementation only focuses on how to use metadata to govern data 
inside the lake. We use Apache Atlas!!, which is a data governance 
and metadata management framework, to implement our extended 
version of goldMEDAL (Section 3.7.1). With these metadata, we 
build a data catalog that allows searching and filtering data through 
different metadata attributes, organize data with user-defined clas- 
sifications and the thesaurus and trace data lineage. 


3.8 Data Security Layer 


The data security layer provides tools to ensure data security. It 
should ensure the user’s authenticity, data confidentiality and in- 
tegrity. 

ArchaeoDAL’s implementation orchestrates more than twenty 
tools and frameworks, most of which have their own authentication 
system. If we used their default authentication systems, a given 
user could have twenty login and password pairs. But even if a user 
decided to use the same login and password for all services, s/he 
would have to change it twenty times in case of need. To avoid this, 
we have deployed an OpenLDAP server as a centralized authentica- 
tion server. All services connect to the centralized authentication 
server to check user login and password. The access control system 
consists of two parts. First, we need to control the access to each 
service. For example, when a user wants to create a new thesaurus, 
Atlas needs to check whether this user has the right to. Second, we 
need to control the access to data in the lake. A user may access data 
via different tools and the authorization answer of these tools must 
be uniform. We use Apache Ranger?’ to set security policy rules for 
each service. For data access control, we implement a role-based 
access control (RBAC) system by using the Hadoop group mapping 
service. 

Data security is enforced by combining the three systems. For 
example, a user wants to view data from a table stored in Hive. 
S/he uses his/her login and password to connect to the Zeppelin 
notebook. This login and password are checked by the OpenLDAP 
server. After login, s/he runs a SQL query that is then submitted to 
Hive. Before query execution, Hive sends an authorization request 
to Ranger. If Ranger allows the user to run the query, it starts and 
retrieves data with the associated user credentials. Then, HDFS 


1 https://atlas.apache.org/ 
!2https://ranger.apache.org/ 
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checks whether the associated user credentials have the right to 
access data. If a user does not have the right to read data, an access 
denied exception is produced. 


3.9 Workflow Manager Layer 


The workflow manager layer provides tools to automate the flow 
of data processes. This layer is optional. 

For now, we have not identified any repeated data processing 
tasks in the HyperThesau project. As a result, we have not im- 
plemented this layer. However, it can easily be implemented by 
integrating tools such as Apache Airflow!’ or Apache NiFi!*. 


3.10 Resource Manager Layer 


As data lakes may involve many servers working together, the 
resource manager layer is responsible for negotiating resources and 
scheduling tasks on each server. This layer is optional, because a 
reasonably small data lake may be implemented on one server only. 

As ArchaeoDAL rests on a distributed storage and computation 
framework, a resource manager layer is mandatory. We use YARN?°, 
since this resource manager can not only manage the resources in 
a cluster, but also schedule jobs. 


3.11 Communication Layer 


The communication layer provides tools that allow other layers, 
e.g., data application, data security and data governance, to com- 
municate with each other. It must provide both synchronous and 
asynchronous communication capability. For example, a data trans- 
formation generates new metadata that are registered by the data 
governance layer. Metadata registration should not block the data 
transformation process. Therefore, the data insights and governance 
layers require asynchronous communication. However, when a user 
wants to visualize data, the data application and security layers 
require synchronous communication, since it is not desirable to 
have a user read data before the security layer authorizes the access. 

ArchaeoDAL’s implementation uses Apache Kafka, which pro- 
vides a unified, high-throughput, low-latency platform for handling 
real-time data feeds. Kafka can connect by default many frame- 
works and tools, e.g., Sqoop, Flume, Hive and Spark. It also provides 
both synchronous and asynchronous communication. 


3.12 ArchaeoDAL’s Deployment 


We have deployed ArchaeoDAL’s implementation in a self-hosted 
cloud. The current platform is a cluster containing 6 virtual ma- 
chines, each having 4 virtual cores, 16 GB of RAM and 1 TB of 
disk space. We use Ambari!® to manage and monitor the virtual 
machines and installed tools. The current platform allows users to 
ingest, store, clean, transform, analyze, visualize and share data by 
using different tools and frameworks. 

ArchaeodAL already hosts the data of two archaeological re- 
search facilities, i.e., Bibracte!” and Artefacts!®. Artefacts currently 
amounts to 20,475 records and 180,478 inventoried archaeological 


13h ttps://airflow.apache.org/ 
4https://nifi.apache.org/ 
1Shttps://yarnpkg.com/ 
16https://ambari.apache.org/ 
17http://www.bibracte.fr/ 
18https://artefacts.mom.ft/ 
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objects. Bibracte currently contains 114 excavation site reports that 
contain 30,106 excavation units and 83,328 inventoried objects. We 
have imported a thesaurus developed by our partner researchers 
from the Maison de l’Orient et de la Méditerranée!’, who study an- 
cient societies in all their aspects, from prehistory to the medieval 
world, in the Mediterranean countries, the Near and Middle-East 
territories. This thesaurus implements the ISO-25964 norm. A dedi- 
cated thesaurus by the linguistic expert of the HyperThesau project 
is also in the pipe. With the help of our metadata management 
system, users can associate data with the imported thesaurus, and 
our search engine allows users to search and filter data based on 
the thesaurus. 


4 APPLICATION EXAMPLES 


In this section, we illustrate how ArchaeoDAL supports users 
throughout the archaeological data lifecycle, via metadata. We also 
demonstrate the flexibility and completeness of ArchaeoDAL (Sec- 
tion 2.3). 


4.1 Heterogeneous Archaeological Data 
Analysis 


In this first example, we show how to analyze heterogeneous ar- 
chaeological data, how to generate relevant metadata during data 
ingestion and transformation, and how to organize data flexibly via 
the metadata management system. 

The Artefacts dataset consists of a SOL database of 32 tables anda 
set of files that stores detailed object descriptions as semi-structured 
data. This dataset inventories 180,478 objects. 

The data management system is implemented with Apache At- 
las (Section 3.7.3) and provides three ways to ingest metadata: 1) 
pre-coded atlas hook (script), 2) self-developed atlas hook and 3) 
REpresentational State Transfer (REST) API. 


4.1.1 Structured Data Ingestion and Metadata Generation. To im- 
port data from a SQL database, we use a hook dedicated to Sqoop 
that can generate the metadata of the imported data and ingest them 
automatically. A new database is created in ArchaeoDAL and its 
metadata is generated and inserted into the Atlas instance (Figure 4). 
Figure 5 shows an example of table metadata inside Artefacts. 


4.1.2 Semi-structured Data Ingestion and Metadata Generation. We 
provide three ways to ingest semi-structured data. The simplest 
way is to use Ambari’s Web interface (Figure 6). The second way is 
to use the HDFS command-line client. 

The third way is to use a data ingestion tool. ArchaeoDAL pro- 
vides a tool called Flume. The Flume agent can monitor any file 
system of any computer. When a file is created on the monitored 
file system, the Flume agent will upload it to ArchaeoDAL automat- 
ically. 

Now that we have uploaded the required files into ArchaeoDAL, 
we need to generate and insert the metadata into Atlas. Since Atlas 
does not provide a hook for HDFS, we develop our own”. This 
hook is triggered by the HDFS create, update and delete events. For 
example, the upload action generates a file creation event in HDFS, 
which in turn triggers our metadata generation hook (Listing 1). 


1? https://www.mom.fr/ 
20 https://github.com/pengfei99/Atlas HDFSHook 
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After the hook has uploaded metadata into Atlas, it can be visualized 
(Figure 7). 


{ Yemesienes’s |L 4 
"typeName": "hdfs_path", 
"createdBy": "pliu", 


"attributes": { 
"qualifiedName": "hdfs://1lin02.udl.org: 900 
@/HyperThesau/Artefacts/object-168.txt", 


"name": "“object-168.txt", 

"path": "hdfs://1in0@2.udl.org:9000/ 
HyperThesau/Artefacts", 

"user": "pliu", 

“eroup = artefacts  , 

"creation_time":"2020-12-29", 

"owner": "pliu", 


"numberOfReplicas":2@, 
PfileSize™ = 36763, 
"isFile": true 


balls, 


Listing 1: Sample generated metadata from a HDFS file 


We also developed an Atlas API in Python?! to allow data engi- 
neers to generate and insert metadata into Atlas more easily. As 
Amazon S3 is the most popular cloud storage, we also developed 
an Atlas $3 hook??. 


4.1.3. Data Transformation and Metadata Generation for Data Lin- 
eage Tracing. Once Artefacts data are ingested into ArchaeoDAL, 
let us imagine that an archaeologist wants to link the detailed de- 
scription of objects (stored in table objects) with their discovery and 
storage locations. The first step is to convert the semi-structured ob- 
ject descriptions into structured data. We developed a simple Spark 
Extract, Transform and Load (ETL) script for this sake. Then, we 
save the output structured data into Hive tables location (discovery 
location) and musee (the museum where objects are stored). 

Eventually, we join the three tables into a new table called ob- 
jects_origin that contains the objects’ descriptions and their discov- 
ery and storage locations. 

Thereafter, objects_origin’s metadata can be gathered into Atlas 
with the help of the default Hive hook’? and a Spark hook developed 
by Hortonworks”*. All Hive and Spark data transformations are 
tracked and all relevant metadata are pushed automatically into 
Atlas. Figures 8 and 9 show table objects_origin’s metadata and 
lineage, respectively. 


4.1.4 Flexible Data Organization. As we mentioned in Section 2.3, 
existing data lake solutions do not allow users to define their own 
ways of organizing data, while ArchaeoDAL users should. Moreover, 
ArchaeoDAL must allow multiple data organizations to coexist. 
For example, let us define four different ways to organize data: 
1) by maturity, e.g., raw data vs. enriched data; 2) by provenance, 
e.g., Artefacts and Bibracte; 3) by confidentiality level, e.g., strictly 
confidential, restricted or public; and 4) by year of creation. 


21 https://pypi.org/project/atlaspyapi/ 

“2 https://pypi.org/project/atlass3hook/ 
3https://atlas.apache.org/1.2.0/Hook-Hive.html 
*4https://github.com/hortonworks-spark/spark-atlas-connector 
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Figure 4: Database metadata 


location (hive_table) 


Classifications: [artefacts x] [+] 
Term: 


Properties Lineage Relationships Classifications Audits Schema 
Key Value Show Empty Values 
columns (6) : [~ | 
id tate 
id_pays 


id_departement 
id_commune 


id_lieu_dits 


comment Imported by sqoop on 2019/11/20 18:10:10 

createTime Wed Nov 20 2019 18:10:14 GMT+0100 (Central European Standard Time) 
db artefacts 

lastAccessTime Wed Nov 20 2019 18:10:14 GMT+0100 (Central European Standard Time) 
name location 


Figure 5: Table metadata 


This is achieved through Atlas’ classifications, which can group 
data of the same nature. Moreover, data can be associated with mul- 
tiple classifications, i.e., be in different data groups at the same time. 
Figure 10 shows the implementation of the above data organization 
in Atlas. We associate table object_origin with four classifications 
(i.e. enriched, Artefacts, confidential, 2020). With table object_origin 
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belonging to four classifications, if we want to filter data by matu- 
rity, we click on the enriched classification to find the table. In sum, 
classifications allow users to organize data easily and flexibly. 


4.2 Data Indexing and Search through thesauri 


As mentioned in Section 3.7.2, thesauri are important metadata 
for project HyperThesau. As an example, we import a thesaurus 
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G Open (Rename Permissions [ff Delete & Copy @Move & Download 4 concatenate Q 


Name Size Last Modified 


Owner Permission Erasure Coding Encrypted 


© Bibracte - 2019-11-12 14:24 


admin hdfs drwxr-xr-x No 


Figure 6: Data upload interface 


& object-168.txt (hdfs_path) 


Classifications: | 
Term: [+] 


Properties Lineage Relationships Classifications Audits 
Key Value Show Empty Values 
createTime Tue Dec 29 2020 00:00:00 GMT+0100 (Central European Standard Time) 
fileSize 36763 
group artefacts 
isFile false 
isSymlink false 


modifiedTime 


name object-168.txt 

numberOfReplicas 0 

owner pliu 

path hdfs://lin02.udl.org:9000/HyperThesau/Artefacts 


qualifiedName 


Tue Dec 29 2020 00:00:00 GMT+0100 (Central European Standard Time) 


hdfs://lin02.udl.org:9000/HyperThesau/Artefacts/object-168.txt 


Figure 7: Sample metadata visualization in Altas 


provided by the archaeological research facility called Artefacts into 
ArchaeoDAL. This thesaurus is mainly used to index the inventoried 
archaeological objects of Artefacts. Its basic building blocks are 
terms that can be grouped by categories (Figure 11). 

We can associate any data entity with any term. A data entity 
can be associated with multiple terms. After we index a data entity 


with a term, we can search data by using the terms of a thesaurus. 


For example, we have a database table called bibliographie and a 


file called 204docannexe.csv that contains information about shields. 


Suppose we need to associate these two data entities with the term 
bouclier (shield in French). After indexing, we can click on the term 
bouclier to find all data associated with this term (Figure 12). 


4.2.1. Data Indexing with Multiple thesauri. One of the biggest 
challenges of project HyperThesau is that each research facility uses 
its own thesaurus. Moreover, there is no standard thesaurus. Thus, 
if we index data with one given thesaurus, archaeologists using 
another one cannot use ArchaeoDAL. To overcome this challenge, 
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ArchaeoDAL supports multiple thesauri. Moreover, we can define 
relations, e.g., synonyms, antonyms, or related terms, between 
terms of different thesauri. For example, Figure 13 shows a term 
from a Chinese thesaurus that we set as the synonym of the term 
bouclier. As a result, even though the Chinese term does not relate 
to any data directly, by using the relations, we can find terms that 
are linked to actual data. A full video demo of this example can be 
found online?>. 


5 CONCLUSION AND FUTURE WORKS 


In this article, we first introduce the need of archaeologists for soft- 
ware platforms that can host, process and share new, voluminous 
and heterogeneous archaeological data. Data lakes looking like a vi- 
able solution, we examine different existing data lake solutions and 
conclude that they are not generic, flexible nor complete enough to 
fulfill project HyperThesau’s requirements. 


*Shttps://youtu.be/OmxsLhk24Xo 
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objects_origin (hive_table) 


Classifications: 
Term: [+] 


Propagated Classifications: | Artefacts | Artefacts | Artefacts 


Properties Lineage Relationships Classifications Audits Schema 


Key Value Show Empty Values 


columns (12) 
obj.id w 


obj.id_location 
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mu.id 
mu.nom = 
createTime Sat Dec 26 2020 22:26:27 GMT+0100 (Central European Standard Time) 
db artefacts 
lastAccessTime Sat Dec 26 2020 22:26:27 GMT+0100 (Central European Standard Time) 
name objects_origin 
owner pliu 
parameters p fe] 


Figure 8: Metadata per se 


objects_origin (hive_table) 
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Figure 9: Lineage 


As a result, we propose a generic, flexible data lake architecture generic because it does not depend on any specific technology. For 
that covers the full data lifecycle. ArchaeoDAL’s architecture is example, in our current implementation, we use HDFS as the storage 
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Figure 10: Sample Atlas classification 
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Figure 11: Artefacts’ thesaurus in Atlas 


layer. Yet, one of our collaborators could easily replace HDFS by 
Amazon S3. In Section 4.1.4, we demonstrate how to organize data 
flexibly, which many existing solutions [1, 5, 6, 15] do not allow. 
In Section 4.1.4, we also demonstrate that ArchaeoDAL can gather 
metadata automatically during the full data lifecycle. Eventually, 
many features of ArchaeoDAL are very hard to demonstrate in a 
paper. Thus, we recorded demo videos that are available online®. 

Archaeologists encounter two major problems while working 
with ArchaeoDAL. First, to associate data and terms in a thesaurus, 
domain experts are needed. Moreover, this data-terms matching 
is a very expensive and time-consuming operation. Thus, we plan 
to use natural language processing techniques to associate data 
with a thesaurus automatically, calling domain experts only for a 
posteriori verification. 


26 https://youtube.com/playlist?list=PLrj4IMV47Fy pKK5WyEd40j3-JnfSuU_H1 
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Second, we handle a lot of images, e.g., aerial photographs and 
satellite images. It is also very time consuming to detect useful 
objects in such images. Although some machine learning tasks 
can already be performed from ArchaeoDAL via Spark-ML, we 
would like to use deep learning techniques to assist archaeologists 
in processing images more efficiently. 
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ABSTRACT 


Searching for an available parking space is a stressful and time- 
consuming task, which leads to increasing traffic and environmental 
pollution due to the emission of gases. To solve these issues, var- 
ious solutions relying on information technologies (e.g., wireless 
networks, sensors, etc.) have been deployed over the last years 
to help drivers identify available parking spaces. Several recent 
works have also considered the use of historical data about parking 
availability and applied learning techniques (e.g., machine learning, 
deep learning) to estimate the occupancy rates in the near future. 
In this paper, we not only focus on training forecasting models for 
different types of parking lots to provide the best accuracy, but also 
consider the deployment of such a service in real conditions, to 
solve actual parking occupancy problems. It is therefore needed 
to continuously provide accurate information to the drivers but 
also to handle the frequent updates of parking occupancy data. The 
underlying challenges addressed in the present work so concern 
(1) the self-tuning of the forecasting model hyper-parameters ac- 
cording to the characteristics of the considered parking lots and 
(2) the need to maintain the performance of the forecasting model 
over time. 

To demonstrate the effectiveness of our approach, we present in 
the paper several evaluations using real data provided for different 
parking lots by the city of Lille in France. The results of these 
evaluations highlight the accuracy of the forecasts and the ability 
of our solution to maintain model performance over time. 
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1 INTRODUCTION 


Searching for an available parking space is stressful, time-consuming 
and contributes to increasing traffic and pollution. According to [10], 
this activity is a basic component of urban traffic congestion and 
generates between 5% and 10% of the traffic in cities and up to 60% 
in small streets. Moreover, it leads to unnecessary fuel consump- 
tion and environment pollution due to the emission of gases. Other 
studies emphasize the costs of searching for a parking space. For 
example, a study by the Imperial College in London mentions that, 
during congested hours, more than 40% of the total fuel consump- 
tion is spent while looking for a parking space (Imperial College 
Urban Energy Systems Project). As a final example, a study by 
Donald Shoup [23] indicates that, in Westwood Village (a commer- 
cial district next to the campus of the University of California, Los 
Angeles), searching for a parking spot leads every year to about 
47,000 gallons of gasoline, 730 tons of COz emissions and 95,000 
hours (eleven years) of drivers’ time. Last but not least, circling for 
a vacant parking space may lead to safety issues, as drivers may 
not pay enough attention to surrounding cyclists and pedestrians 
at that time. 

In order to reduce the time needed to find an available parking 
space, it may be helpful to provide drivers with accurate and up-to- 
date information on parking lots occupancy. This information can 
indeed be collected using various sensors and delivered to drivers 
through on board-units, smartphones or by exploiting dynamic 
staking. Moreover, to guide drivers towards locations where they 
will actually find a place to park, it is crucial to forecast parking 
occupancy in the near future (typically one hour or less) to inform 
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drivers while they are still on their way. Recent research works [5, 
24] have therefore considered the use of parking history data and 
compared different techniques (time series, machine learning, etc.). 
These works show promising results and open new perspectives 
for parking management. 

Thus, in the future, drivers could request a parking close to their 
destination and/or matching their preferences (walking time to 
destination, price, etc.) while they are still driving, and be guided 
towards the best option according to the real-time supply and de- 
mand. In this paper, we address the challenges that need to be 
tackled to reach such an objective, namely (1) provide a forecasting 
model which easily adapts to the parking lot considered in order 
to provide the best accuracy and (2) maintain the performance of 
the forecasting model over time in spite of the frequent changes 
on parking occupancy data. Therefore, we propose a framework 
exploiting machine learning techniques, and specially Recurrent 
Neural Networks - Long Short-Term Memory (RNN-LSTM), to fore- 
cast parking occupancy in a short-term future. We show that every 
parking lot has a different profile in terms of parking occupancy 
and so requires a dedicated tuning of the forecasting model in or- 
der to obtain the best performance. The framework introduced 
in the paper therefore integrates self-tuning features for different 
hyper-parameters in order to obtain the best forecast whatever the 
parking considered. Besides, to maintain a good forecast accuracy, 
the forecasting models used by our framework have to be continu- 
ously updated within a limited time to guarantee a continuous and 
effective service to drivers. 

The structure of the rest of the paper is as follows. In Section 2, 
we present some related works on building parking occupancy 
prediction models. Section 3 describes the proposed framework. In 
section 4, we evaluate this framework and show results that prove 
the feasibility and the efficiency of our proposal. Finally, we depict 
our conclusions and draw some prospective lines for future works 
in Section 5. 


2 RELATED WORKS 


Recently, managing traffic, mobility, and transport systems in the 
context of smart cities has attracted the attention of many re- 
searchers. Particularly, in the area of parking management, a lot 
of solutions have been proposed, especially to fulfill the needs of 
providing sustainable parking services. Solutions cover different 
parking management aspects and smart parking technology [9, 15]. 
Nevertheless, these approaches still remain less precise and less 
efficient due to the lack of real-time forecasts, that adapt to the 
context of the parking. 

Several recent works have focused on forecasting future parking 
occupancy using various approaches such as Linear Regression (LR), 
Auto-regressive integrated moving average (ARIMA) [1, 5, 20, 29], 
Support Vector Regression (SVR), Support Vector Machine (SVM), 
and Multiple Linear Regression (MLR) [29], Random Forest, Naive 
Bayes, Markov model [2, 26], Multilayer Perceptron (MLP) [3, 22], 
Long Short-Term Memory (LSTM) [11, 20, 27] and different other 
types of neural networks. Most of these works focus on comparing 
models in order to find the method that provides the best perfor- 
mance [3, 29]. 
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Optimizing the performance of a model by integrating exogenous 
factors has been considered by [1] in the parking context. In this 
work, the authors compare Reccurent Neural Network - Long Short 
Term Memory (RNN-LSTM) and Reccurent Neural Network - Gated 
Recurrent Units (RNN-GRU)and explain that exogenous factors 
(e.g., weather, event, time of the day) affect forecasting results. 
They also show that a model trained with correctly configured 
hyper-parameters has an effect on the model quality. 

By using neural networks, it is possible to develop models inte- 
grating spatio-temporal factors such as the Graphical Convolutional 
Neural Network (GCNN) [27]. In this work, the authors investi- 
gate the spatial and temporal relationships of traffic flows in the 
network, weather, parking meters transactions, network topology, 
speed and parking lots usage. The prediction horizon considered 
is 30 minutes and the authors conclude that the forecast is better 
for business parking lots than for recreational areas among other 
results. 

In [12], the authors use Convolutions Neural Networks (CNN) to 
extract spatial relations and LSTM to capture temporal correlations. 
The authors also use a Clustering Augmented Learning Method 
(CALM) which iterates between clustering and learning to form a 
robust learning process. An experimentation done using the San 
Francisco Parking dataset is presented. The results show that CALM 
outperforms other baseline methods including multi-layer LSTM 
and Lasso to predict block-level parking occupancy. 

Unsurprisingly, existing works focus on obtaining the best model 
accuracy by applying different approaches. A major technique 
to improve the model performance consists in finely tuning the 
model hyper-parameters [16, 19, 25]. The cost of tuning the hyper- 
parameters by hand is very expensive and requires a lot of expertise. 
Every machine learning algorithm indeed has a distinct sensitivity 
concerning hyper-parameters tuning [21]. Therefore, a few works 
have addressed the automatic tuning of hyper-parameters [13]. The 
work presented in [13] introduces a data-driven and supervised 
learning approach applied to hyper-parameters optimization in 
order to substitute to the expert for tuning the hyper-parameters 
of the model. The work considers the optimization of a few major 
hyper-parameters (network selection, number of layers and regular- 
ization function) by designing a new algorithm to enhance model 
performance [13]. 

Meanwhile, [21] presents several ideas to optimize the selection 
of hyper-parameters. This work is applied to the management of 
spatial data in the context of ecological modeling. In [7], model opti- 
mization is also considered to implement an evolutionary algorithm 
for parking occupancy forecasting in urban environments. The au- 
thors present real case implementation scenarios. Their approach 
is accurate and useful for the everyday life. 

There are obviously a lot of remaining challenges to tackle to 
develop robust forecasting model and solve real parking problems. 
Firstly, the deployment of a tuned/optimized forecasting model 
is needed. Secondly, it is necessary to maintain the performance 
of the model over time since updates are continuously produced. 
Hence, we focus on proposing a solution to tackle these problems. 
In this paper, we do not focus on exploring the correlations between 
exogenous factors and parking occupancy, even if some of them are 
integrated in our forecasting models. We rather extend the work 
by [7] focusing only on the number of layers and number of neurons 
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per layer, by considering also hyper-parameters tuning for learning 
rate, look-back window, batch size, and number of epochs). This self- 
tuning of hyper-parameters is indeed needed to achieve an efficient 
model deployment. Besides, an important limitation of existing 
solutions resides in the lack of integration of real-time information 
to refresh the forecasting model. This problem is tackled by applying 
model monitoring and dynamic model selection in [8], but not in 
the context of parking management. 

In the following, we detail our solution to automatically tune 
hyper-parameters of our forecasting models for parking occupancy. 
Our objective is to obtain the best performance. Moreover, our 
forecasting models exploit real-time data as an input which is also 
exploited to update it and maintain its performances overtime. 


3 CONTINUOUS FORECASTING OF PARKING 
OCCUPANCY 


In this paper, we consider the problem of forecasting the occupancy 
of parking lots in the near future (i.e., up to 1 hour in advance), 
considering the use of parking historical data. As discussed in sec- 
tion 2, several works have already designed forecasting models and 
evaluated their effectiveness on parking occupancy data. The au- 
thors therefore exploit real data and, after cleaning these data, train 
models using one subset of the dataset called training data. Then, 
they compare the accuracy of the forecast using different metrics on 
an other portion of the data set called testing data. Several models 
have thus been compared and some of them provide promising 
results, specially Recurrent Neural Networks - Long Short Term 
Memory (RNN-LSTM) [1, 5, 22, 27, 28]. 

The originality of our work is to consider the deployment of 
forecasting techniques for parking occupancy in real conditions. 
One possible use-case could for example concern the design of 
a service continuously providing drivers, during their travel, the 
estimated future occupancy of the parking lots close to their target 
destination in order to choose one. In this section, we highlight 
the challenges to be addressed to design such a service, not only 
to obtain a good forecast in a given context, but also to provide 
effective mechanisms to generalize the process to various types 
of parking lots. Moreover, to propose a continuous forecasting 
service to drivers, solutions are needed to maintain the quality of 
the forecast over time. In the following, we first detail the challenges 
that need to be addressed to achieve these objectives and then 
present our solutions. 


3.1 Challenges 


Developing a robust and reliable model to perform accurate fore- 
casts of parking lots occupancy is a challenging task, specially when 
considering real parking circumstances. In the following, we detail 
the two main challenges we have identified. 

The first challenge to address concerns the design of an efficient 
model whatever the type of parking lot considered. Indeed, every 
parking lot has a different "profile" in terms of capacity, surrounding 
amenities, occupation trend or seasonality. For instance, Figure 1 
illustrates the weekly evolution of parking occupancy for two dif- 
ferent parking lots in the city of Lille, one close to a commercial 
center and the other one close to a train station. Thus, a model 
tuned for a given parking lot may not provide good results if used 
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on a different one. Moreover, tuning and optimizing a forecasting 
model implies a lot of different parameters, called hyper-parameters 
(number of layers, number of neurons per layer, activation func- 
tion, type of optimizer, loss function, number of epochs, learning 
rate, the look back windows, batch size, etc.). Numerous values or 
choices are possible for each one of these parameters and selecting 
the best combination for all these parameters is so a non-trivial and 
time-consuming task. Hence, manually tuning the parameters of 
the forecasting model is not an option and a self-tuning process 
is required for designing an efficient model for each parking lot 
where forecasts are needed. 


Available parking spaces by Hour and Day of Week (Parking Euralille) 


Available parking spaces by Hour and Day of Week (Parking Gare Lille Flandres) 


Figure 1: Euralille and Gare Lille Flandres parking profiles 


The second challenge to handle when designing a forecasting 
model concerns the lifetime of this model. The objective when deal- 
ing with parking occupancy is obviously to perform a short-term 
forecast but the question here is how long the forecasting model 
can effectively operate and provide an accurate forecast. Real-time 
updates of the parking occupancy are indeed continuously gener- 
ated and may impact the future forecasts. In this paper, our goal is 
not only to prove that forecasting the future occupancy of a parking 
lot can be achieved with a high accuracy but mainly to ensure that 
the quality of the forecast will remain high as time passes. Hence, 
data about parking occupancy have to be continuously collected 
and regularly injected in the forecasting models in order to keep 
them up-to-date and efficient. This will indeed avoid propagating 
the error obtained on a forecast done by a model to the following 
ones, what would cause a very quick degradation of the forecast 
quality. 

In our context, we so have to determine the lifetime of a forecast- 
ing model (i.e., the period during which the model gives relevant 
and significant results). Then, the model has to be updated using 
data collected since the initial training or since its last update. Let 
us note that the update phase should be done regularly and within 
relatively short periods of time to avoid an interruption of the fore- 
cast process or a strong degradation of its performance. This update 
phase will also allow to keep the forecast model fitted to the parking 
profile. Indeed, these profiles illustrated in Figure 1 may obviously 
change over time. 


3.2 Contributions 


In this section, we present our solution to address the challenges 
introduced in section 3.1, namely the self-tuning of the forecasting 
model and its retraining over time to keep a good quality forecast. 
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3.2.1 Self-tuning of the forecasting models. As mentioned in sec- 
tion 2, several works have compared several regression methods 
(SARIMAX, MLR, SVR) and RNN-LSTM to determine the best one(s) 
to forecast parking occupancy. These studies concluded that RNN- 
LSTM provides the best prediction results [1, 5, 22, 27, 28]. More- 
over, the experimentation we conducted (see section 4) lead to the 
same conclusions. Hence, we will focus on RNN-LSTM to forecast 
parking occupancy in the rest of the paper. 


Data 
Collection 


Model 
Deployment 


Train Model 
|__| 9 Load Model 
8 


Save Model 


Preprocessing 


and data split Model Training 


Training 


8 


Optimization 


Auto tuning 


Update model 


Model 
Monitoring 


| 
Model lifetime 


overtime? 


Figure 2: Process Flow 


Figure 2 describes the process we propose to manage the hetero- 
geneity of parking profiles and deploy the best forecasting model 
whatever the characteristics of the target parking lot. It shows the 
five main steps of our approach: data collection, pre-processing, 
model training, evaluation and deployment. In this process, a large 
amount of data about parking occupancy first has to be collected, 
cleaned and normalized. As usual, this dataset is split in several 
parts used for training the model in the learning phase and testing 
it, that is comparing the forecasts obtained with the trained model 
with the actual data contained in the testing set. 

When using neural networks, there are two main customization 
techniques to optimize the model performance concerning both 
the architecture of the network (or network selection) and the 
tuning of the hyper-parameters used during the training phase. This 
customization phase allows obtaining good performance according 
to the forecasting objectives. To determine the network structure, 
five variables have to be considered: 


(1) Number of layers: represents the number of hidden layers 
between the input and output layers in the neural network. 

(2) Number of neurons: In the learning process, each neuron at 
a given layer calculates the weighted sum of inputs, add the 
bias and execute an activation function. 

(3) Dropout function: provides a regularization method used 
to avoid over-fitting. Over-fitting occurs when a model is 
constrained to the training set and does not perform well on 
unseen data. 


IDEAS 2021: the 25th anniversary 


Miratul Khusna Mufida, Abdessamad Ait El Cadi, Thierry Delot, and Martin Trépanier 


(4) Activation function: this function is used to add non-linearity 
to the objective function of the model such as relu, sigmoid 
or tanh. 


Moreover, for hyper-parameters tuning, several items also have 
to be considered, namely: 


(1) Type of optimizer: this item has a crucial role to increase the 
accuracy of the model. Different types of optimizer can be 
selected, such as Stochastic Gradient Descent (SGD), Nesterov 
Accelerated Gradient, Adagrad, AdaDelta, Adam or RMSPROP. 

(2) Number of epochs: this number determines how many times 
the whole training is executed for the neural network during 
the training phase. 

(3) Batch size: it represents the size of the portions of the dataset 
used to train the network at each iteration. 

(4) Loss function: this function provides a way to calculate the 
loss, that is a prediction error of the neural network. Again, 
several choices are possible such as Mean Absolute Error 
(MAE), Mean Squared Error (MSE), etc. 

(5) Learning rate: it determines how much the model should 
change according to the estimated error each time the model 
weights are updated. 

(6) Regularization function and factor: these elements impose 
constraints on the weights within the network to avoid over- 
fitting and thus improve the model performance. 


Default values can be determined for these different parameters 
but obviously, in order to obtain the best model, a tuning phase 
is necessary. The gap between different combinations on the fore- 
cast quality is important and will be illustrated in section 4. This 
selection process can of course be done manually by changing the 
number of layers, the number of neurons per layer, the activation 
function, the learning rate, etc. However, due to the very high num- 
ber of parameters and possible combinations, this process may be 
very costly in terms of time and resources. The chances to quickly 
determine the best combination are low and this process requires a 
strong expertise for the person in charge of this tuning phase to be 
effective. 

Hence, the big question here is how to make the tuning of our 
forecast model autonomous in order to best fit with the parking 
characteristics [13]. Basically, there are two possible strategies to 
achieve an optimal tuning. First, this can be done by generating 
an algorithm to automatically tune the hyper-parameters and de- 
termine an optimal combination according to the objective. The 
second one is to develop an algorithm that reduces the set of hyper- 
parameters to tune by fixing the optimal value for other ones [13]. 
In the following, we investigate the first approach and present a 
self-tuning mechanism for the hyper-parameters of our forecasting 
model. This solution is more generic and can be applied on different 
parking lots in different contexts. 

To simplify the problem, we focus on automatically tuning the 


possible combinations of hyper-parameters. Thus, auto hyper-parameter 


tuning applies for several variables and presents the output faster 
than manual hyper-parameter tuning as we will show in section 4. 

Algorithm 1 uses Random Search [6] to determine the optimal 
value for different parameters of a RNN-LSTM model to forecast 
parking occupancy in a short-term future. We selected Random 
Search to tune the parameters because it outperforms grid search [6]. 
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More precisely, this algorithm tries to set the best values for subset 
of hyper-parameters which are the number of layers, the number 
of neurons per layer, the learning rate, the look back window, the 
batch size, type of optimizer and the number of epochs. 


Algorithm 1: Random Search for auto-tuning the hyper- 
parameters 


Require: NL Number of layers 
Require: NN Number of neurons per layer 
Require: LR Learning rate 
Require: OP Optimizer 
Require: D Dropout 
Require: RD Recurrent Dropout 
Require: BS Batch Size 
Require: NE number of epochs 
Require: LW Look back window 
Require: AF Activation Function 
1: NL <— [3..10] 
2, LW < [2, 4, 6, 12, 24, 48, 96, 192] 
3. NN < [2, 4,8, 16, 32, 64, 128] 
4: AF — 
[’linear’,’ ReLu’,’ sigmoid’,’ tanh’,’ SeLu’,’ hyperbolic’ | 
5: D, RD — [0.1..1.0] 
6: BS — [2, 4, 8, 16, 32, 64, 128] 
7. NE — [25,50, 70, 100] 
8: LR — [0.001, 0.003, 0.005, 0.03, 0.05, 0.01, 0.1, 0.3, 0.5] 
9: OP — [/ADAM"’,’ ADAGRAD’,’ RMSPROP’,’ SGD’ | 


Ensure: Trained model with particular NL, LW, NN, AF, D, RD, 


BS, NE, LR and OP 

Ensure: Accuracy Metrics 

Ensure: Multistep Training and Validation Loss 
Create RNN-LSTM Model 


10: STEP 1 : Set randomly the hyper-parameters (NL, LW, NN, AF, 


D, RD, BS, NE, LR, OP) 

11: STEP 2: Mx < Create the model with the chosen 
hyper-parameters 

12: STEP 3 : Ax <— Evaluate the accuracy of the model M« 


13: STEP 4: Pick up another values of hyper-parameters, using a 


given distribution 

14: STEP 5: M <— Update the model with the new 
hyper-parameter 

15: STEP 6: 

16: if (A is better than A*) then 

17: Ax << Aand 

18: Mx —M 

19: end if 

20: STEP 7: 

21: if (we reach MAX iteration) then 

22: exit 

23: return Mx 

24. return Ax« 

25: else 

22. Goto STEP 4 

27: end if 
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A few research works already focused on the automation of 
hyper-parameters tuning in the context of parking management [1] 
or in different ones, such as supply chain problems [14]. The limita- 
tions of these works mainly reside in the limited number of layers 
and number of neurons per layers they considered [1]. In this paper, 
we consider the self-tuning of a larger subset of hyper-parameters 
to improve the model performance. 

Once a model is trained, we can thus, using the testing dataset, 
evaluate its performance. Therefore, we compute different met- 
rics to measure the model performance such as RMSE, MAPE, 
AdjustedR? and tracking signal. We select the best accuracy and so 
the maximum value for Adjusted R*, and the minimum values for 
RMSE, MAPE and Tracking Signal. Adjusted R is used to determine 
the explanatory power of our model. It gives the percentage of vari- 
ation by independent variables that affect the dependent variables. 
Tracking signal is used to monitor the quality of our prediction 
model. 


3.2.2 Continuous model update. To provide an effective service to 
drivers and inform them with the future occupancy of close parking 
lots, the challenge is not only to develop or tune the best forecast 
model for each parking lot, but also to maintain its performance 
overtime. The complexity here is to monitor, maintain or improve 
the model performance over time [8]. The process we propose to 
continuously maintain a forecasting model efficient over time is 
depicted in Figure 3. 


| Continuous learning and forecasting 
' by updating model 


Palen tune hyper-parameter 
retrain the model for Z “ut " : 
Make prediction train the whole model 


: | Evawuation | 

7 
‘ ‘ 
‘ { 


prediction result 


| h 
SRS ESS a aa ae aes ' Yes 


Acc within 


wo specification 


' Real-time monitoring 


Figure 3: Update process of forecasting models 


The first challenge is to determine the moment when an action 
is needed to update the model. Prediction models indeed use for- 
mer values or observations to estimate the future ones. When the 
model starts forecasting values, the first forecasts, obtained with 
real observations, are so used to determine the next ones. Hence, 
the potential errors introduced by learning algorithms will accumu- 
late and the quality of the forecasts inevitably decrease over time. 
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So, once the time when the model has to be updated is determined, 
the model has to be effectively updated. Obviously, this update 
has to be performed in a short-time, since it has to be effective 
before the quality of the forecasts provided by the model decreases 
too much. Hence, it is not possible to fully retrain the model con- 
tinuously since this would take too much time. In this work, we 
rather investigate the use of continuous updates provided by the 
parking lot to partially retrain the model. We so assume that the 
parking lots deliver continuous updates of their occupancy every 
few minutes. Obviously, parking occupancy changes continuously 
but gathering information every 2 to 5 minutes allow training an 
efficient model and maintaining its performance over time as we 
will show in section 4. 

The advantage of using neural networks for time series forecast- 
ing resides in the possibility to update the model weights when 
new data is available. Our process to update our forecasting model 
on the fly is depicted in Figure 3. The yellow block represents the 
model monitoring mechanism whereas the grey one depicts the 
update model integrating new data. We start by determining when 
the model performance starts degrading over time to determine 
when exactly the model should be updated. 

To deploy a forecasting model on a given parking lot, we first set 
a LSTM model and train it using historical data. We thus obtain a 
baseline model. This baseline model can provide forecasts during a 
certain period of time before its performance starts degrading. The 
baseline model can thus be deployed and evaluated to compare the 
results with other approaches.Once the baseline model is deployed, 
we so update it using real-time data continuously collected by our 
system. 

The update mechanism is activated when the model performance 
degrades. It operates to maintain the model performance over time 
by injecting real-time data and retrain some epochs. The quality of 
the model updates depends on the number of additional epochs per- 
formed on new data to update the weights of the model. Obviously, 
the size of the dataset used to retrain the model will also impact 
the time needed to perform the update. Experimental results will 
be presented in Section 4 to show the effectiveness of the update 
and the impact of the number of epochs performed. 

The key points of this process reside in the availability of real- 
time data and the monitoring procedure to know when the model 
needs to be updated. Our mechanism so relies on a real-time moni- 
toring and update mechanism to achieve continuous learning and 
forecasting. The real-time monitoring mechanism determines when 
the model should be updated by detecting that performance starts 
degrading. Then, retraining the model regularly with new unseen 
data allows maintaining its performance over time and supporting 
possible changes in the occupancy of the parking lot. 

In this section, we have presented the main challenges to address 
as well as our solutions to provide drivers with parking occupancy 
forecasts in a continuous manner. In the following one, we depict the 
different evaluations realized to show the performance of our self- 
tuning mechanism, to determine the basic lifetime of the forecasting 
models produced and maintain their performance over time. 
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4 EXPERIMENTAL STUDY AND RESULT 


In this section, we describe the experiments realized to show the 
efficiency of our continuous learning mechanism. These experi- 
ments have been conducted using the applied machine learning 
frameworks TensorFlow and Keras. 


4.1 Dataset description and basic settings 


Real-parking data provided by Lille European Metropolis (MEL), the 
largest city in the North of France including 1.2 million people, have 
been used to build and evaluate our models. The dataset contains 
data about the occupancy of 27 parking lots, representing a total 
of 18,180 parking spaces, from December 2018 to February 2020. 
The number of cars parked is refreshed every few minutes and the 
updates can be easily accessed online. Each parking area has a 
different capacity and occupancy "profile" as illustrated in Figure 4 
for Parking Euralille and Figure 5 for parking Gare Lille Europe. 
Parking Euralille, is located near a shopping mall whereas parking 
Gare Lille Europe is next to a train station. 
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Figure 4: Occupancy of parking Euralille over time 
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Figure 5: Occupancy of parking Gare Lille Europe over time 


The raw data are stored in daily csv files. Each file contains the 
daily data for all parking areas. It consists of 12 columns namely: 
label (parking name), address, city, status (open or closed), number 
of available spaces, maximum capacity, date, parking id, coordinates, 
geometry, display panels, timestamp. 

In the following, we evaluate our contributions to continuously 
forecast parking occupancy using this real dataset. However, the 
raw data that we use to deploy the model are noisy and some values 
are missing. In order to make the dataset compliant with the train- 
ing step, we first perform data cleaning. Thus, to handle missing 
values in our time series, we decided to replace them by the value 
observed at the same time during the preceding week. The different 


'For more information, see https://opendata.lillemetropole.fr/explore\/dataset/ 
disponibilite- parkings/information/. 
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phases of the model development can then start: training, valida- 
tion, optimization with (auto) hyper-parameter tuning, testing and 
evaluation. For the experiments considered in the rest of this sec- 
tion, we used a portion of the dataset ranging from December 19%”, 
2018 to March 7“", 2019. 80% of the data were used as training data, 
10% for validation and 10% for testing. To evaluate the quality of 
the models generated and select the best one, several metrics are 
considered: Accuracy, root mean squared error (RMSE), Mean Ab- 
solute Percentage Error (MAPE), AdjustedR? and Tracking Signal. 
MAPE is used to present the forecast error as a percentage, RMSE 
to present the gap between the real data and the forecast using 
the commonly used loss function in the related works. Adjusted 
R? allows determining the percentage of variation by independent 
variables that affect the dependent variables. The closer to 1 the 
better it is. Finally, tracking signal is used to monitor and identify 
the prediction deviation. 


4.2 Results 


In this section, we propose an evaluation of our different contri- 
butions. We start by illustrating how we deploy a baseline model 
and by comparing the performance provided by RNN-LSTM to the 
other forecasting techniques. 


4.2.1 Models Performance Comparison. Our objective here is to 
justify why we focused on RNN-LSTM in section 3. We therefore 
compared the performance of different forecasting models such 
as SARIMAX, MLR, SVR and RNN-LSTM to perform short-term 
predictions on time series dataset. To know which model performs 
the best, we compared the forecasts provided by the four models 
listed above using the Euralille dataset. After training the models, 
we evaluate their performance and provide the MAE, MSE, MAPE 
and RMSE errors in Table 1, as well as, the tracking signal in Figure 6. 
In Table 1 and Figure 6, the smallest error indicates the best model. 
Also, by comparing the values obtained for the training sets to the 
ones collected with the testing sets, the gap indicates the quality of 
the fitting of the model: does it over-fit, under-fit, or does it have 
the best fit. The best fit model performs well to forecast unseen 
data whereas over fitting or under-fitting models have problem to 
perform well on different datasets. 


Table 1: Errors comparison of four different models 


Loss Function | RNN-LSTM | SARIMAX | MLR 
MAE Training 0.13 0.28 0.12 


RMSE Testing 0.69 


The error determined using the training set is known as model 
bias. The common term for the testing set is variance. We here aim at 
obtaining the minimum value for both bias and variance. Tracking 
signal is used to represent the forecasting bias of the model. MAE, 
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MAE Testing 0.13 0.11 0.18 
MSE Training 0.03 0.12 0.31 

MSE Testing 0.13 0.46 0.47 
MAPE Training 17.05% 39.71% 17.02% | 38.29% 
MAPE Testing 18.43% 15.60% 25.53% | 15.60 % 
RMSE Training 0.17 0.34 0.55 0.36 
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Figure 6: Tracking signal comparison for 4 different predic- 
tors 


MSE, RMSE, and MAPE are used to exhibit the forecasting model’s 
variation. Based on the results presented in Table 1 and Figure 6, 
we observe that RNN-LSTM outperforms SARIMAX, MLR and SVR 
with the minimum error and a best fit. Hence, we will focus in the 
rest of this section on RNN-LSTM. 

The first result using a single parking dataset (Euralille) shows 
that RNN-LSTM outperforms the other models. In section 4.2.2, 
we will study the possibility to use a model trained with a particu- 
lar dataset to performs forecasts for another (previously unseen) 
dataset. 


4.2.2 Generalization of model deployment to different parking areas. 
Our objective is to forecast for a large number of parking lots (i.e., 
all the parking lots in a city for instance). An option here could be to 
build a general model (super model) using all the data available and 
then use it to forecast the occupancy of all considered parking lots. 
This is a challenging task [1, 7] and this does not work properly due 
to the specificity of each parking lot in terms of occupancy trend. 
So, we rather decided to optimize one model per target parking lot. 
To illustrate our point, we present the results obtained by training 
and optimizing a model using a single parking area (i-e., Euralille) — 
Table 2 shows the list of hyper-parameters selected for the model 
after the tuning phase — Then, we applied the same optimized model 
on another parking area (i.e., Gare Lille Europe). The forecasting 
results on the original parking — used for the tuning — have scores 
of 0.66% for MAPE and 0.126 for RMSE, while on the new one, we 
have, respectively, the scores of 17,6% and 0.3753. The quality of the 
forecast significantly decreases as shown in Table 3. In conclusion, 
to have the best performance, we have to tune the hyper-parameters 
of each parking lot, by considering each parking profile. A unique 
forecasting model can indeed not provide good forecasts, whatever 
the parking lot. In subsection 4.2.3, we focus on hyper-parameters 
tuning. 


4.2.3 Auto-tuning of hyper-parameters. To actually deploy the best 
model for each parking lot, it is crucial to automatically determine 
the best hyper-parameters. Tuning hyper-parameters is an effective 
technique to increase model performance, as shown in Table 4. 
This approach is obviously time-consuming due to the huge search 
space (i.e., number of parameters and number of combinations). Its 
duration also depends on the selected number of epochs and on 
the performance of the computer used, but training a single model 
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Table 2: Values of hyper-parameters selected for Euralille 


Hyper-parameter Tuning | 
Batch size 32 
Epoch 50 
Prediction horizon 12 | 
Look back window 2016 
Step 12 
Activation Function | _RMSPROP | 
Number of layer 3 
Neuron per layer 128/64/12 
Loss function MAE, Accuracy | 


Table 3: Comparison of errors when using model trained on 
one parking lot (Euralille) on different parking lots 


Parking RMSE MAPE 
Euralille 0.1260 0.66% 
Gare Lille Europe 0.3753 17.6% 


«= MAPE Euralille = MAPE GLE 


10.00% 


09:40:00 10:00:00 10:20:00 10:40:00 11:00:00 11:20:00 


Figure 7: MAPE evolution over time 


Table 4: Impact of hyper-parameters tuning on model per- 
formance 


Model RMSE MAPE 
Baseline 0.37 41.29% 
Tuning 1 0.28 21.86% 
Tuning 2 0.29 30.57% 


Tuning3 ~— 0.28 30.6% 
Tuning 4 0.0.22 28.21% 
Tuning5 = 0.13 21.64% 


usually takes several tens of minutes (57 minutes for one training 
in our experiments). 

To obtain the best fit model that is able to forecast unseen data 
in the future, auto-tuning the hyper-parameters provides the best 
solution. From our experience, using the default parameters for 
tuning a model leads to problems of over-fitting/ under-fitting as 
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shown in Figure 8. Avoiding over-fitting and under-fitting is an 
important objective. Over-fitting indeed refers to the ability of 
the model to fit well to the training data but not to generalize to 
another dataset. Under-fitting corresponds to the characteristic of 
a model both unable to fit to the training set, neither to another 
dataset. Another possibility is to choose the combination of hyper- 
parameters by hand. This is also a challenging task considering the 
size of the search space. Also, since we have to deal with various 
profiles of parking lots, this would require an expert to tune each 
model before it can be exploited. Hence, auto-tuning constitutes 
the best optimization technique and the easiest way to find the best 
fit model setting. It indeed finds the training and validation loss 
that converge at one point, as illustrated in Figure 9, contrary to 
the default settings shown in Figure 8. Table 5 and Table 6 show 
that the auto-tuning of hyper-parameters is efficient and effective 
to obtain the best model performance. 

Table 5 shows the hyper-parameters used for the default tuning 
(for both parking lots) and the ones selected by algorithm 1 intro- 
duced in section 3.2 for parking Euralille and GLE. Table 6 shows 
the improvement of the performance resulting from the auto-tuning 
phase with a strong reduction of the error for both parking lots. 

The auto-tuning of hyper-parameters facilitates model deploy- 
ment by automatically selecting the best hyper-parameters for the 
training phase. Another challenge is related to the fact that the 
predictor performance of each model declines over time. Subsec- 
tion 4.2.4 thus focuses on the necessary updates of the forecasting 
models to avoid a degradation of the forecast quality over time. 


~ 


Multi-Step Training and validation loss 


Figure 8: Manual hyper-parameters tuning and under- 
fitting/over-fitting problems 


Figure 9: Impact of hyper-parameters auto-tuning on under- 
fitting/over-fitting problems 
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Table 5: List of default and optimal hyper-parameters after tuning for Euralille and Gare Lille Europe 


| Hyper-parameters | Default Euralille | Default GLE | Euralille GLE | 
Batch size i 1 64 32 
Epoch 10 25/50/70/100 35/40/50 
| Prediction horizon 72 2/12/48/72 3/12 | 
Look back window 2016 1800/2016 2008 
Step 12 12 12 
| Activation Function ADAM RMSPROP ADAM | 
Number of layer 3 3 5 
Neuron per layer 16/32/72 16/32/72 128/64/12 4/128/64/16/128 
| __Loss function MAE MAE MAE MAE | 
RMSE 0.3647 0.8751 0.1260 0.5569 
MAPE 18.43% 35.85 0.66% 0.03% 


Table 6: Forecasting performance improvement (RMSE and 


MAPE) after auto-tuning of the hyper-parameters 


Model RMSE MAPE 

Default Tuning Euralille 0.3647 18.43% 
Default Tuning GLE 0.8751 35.85% 
Self Tuning Euralille 0.1260 0.66% 
Self Tuning GLE 0.5569 0.03% 


4.2.4 Maintaining model performance over time. In this section, we 
evaluate how long the forecast provided by a model, once tuned, 
is accurate enough to be used in a real situation. We so focus on a 
single day and observe the evolution of the forecast quality over 
time. Figure 10 illustrates the evolution of the MAPE for forecasts 
performed every 10 minutes on two different parking lots (Euralille 
and GLE) during 120 minutes. The MAPE captures the evolution of 
the error (i.e., the difference between the forecast and the observed 
value) over time. We observe in Figure 7 that the model performance 
degrades over time. The error thus exceeds the threshold of 5% after 
30 minutes. This information is very useful to determine when the 
model needs to be retrained to keep a good forecast quality. 

To avoid that the model performance degrades too much and 
that MAPE score exceeds a particular threshold (e.g., 5%), we partly 
retrain the model by running several epochs and injecting the 
last updates of the parking occupancy. Our previous experiments 
on both parking areas also indicate that the model performs well 
during a relatively short period of time (between 10 and 30 minutes) 
shown in figure 10. Hence, the retraining phase should be very 
quick and a full retraining of the model is not possible due to a lack 
of time. Therefore, we investigated the best or lowest number of 
epochs needed to improve the performance of an already tuned 
model. Figure 11 shows the evolution of RMSE for both parking 
lots according to the number of epochs considered when retraining 
the model. The optimal retraining is obtained for both parking lots 
between 30 and 35 epochs. The time needed for this retraining 
phase ranges between 5 and 6 minutes. 

Finally, Figure 12 illustrates the effectiveness of our update mech- 
anism. In this Figure, we indeed see the evolution of the MAPE 
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Figure 10: MAPE evolution over time for parking Euralille 
and Gare Lille Europe 
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Figure 11: RMSE evolution according to the number of 
epochs for parking Euralille and Gare Lille Europe 


error over time passed since the training of the model while apply- 
ing our update process every 30 minutes. Contrary to the scenario 
considered in Figure 10, we observe that the MAPE error is then 
maintained under the threshold of 5%. This result shows that our 
update model mechanism improves the model performance by con- 
tinuously injecting real data. 
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Figure 12: MAPE evolution over time (with updates) for 
parking Euralille and Gare Lille Europe 


5 CONCLUSION AND FUTURE WORK 


In this paper, we have presented solutions to continuously fore- 
cast parking availability using both historical data and real-time 
updates. Therefore, we have proposed a solution that can automati- 
cally adjust the forecasting model to each parking lot, as well as a 
mechanism to keep the model efficient over time. 

Besides, parking is competitive by nature because after making 
a choice to visit a particular slot, the success in obtaining that slot 
will depend on the choice of other closer vehicles [4]. Obviously, 
forecasting parking availability in the near future does not solve 
this issue. However, features to provide real-time forecast available 
parking space may significantly improve frameworks designed 
to allocate vehicles to available parking spaces [17, 18], and thus 
manage competition, by providing them information about the 
future parking occupancy to make better decisions. 
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Abstract 


Patient-Reported Outcome (PRO) surveys are used to monitor 
patients’ symptoms during and after cancer treatment. Late symp- 
toms refer to those experienced after treatment. While most patients 
experience severe symptoms during treatment, these usually subside 
in the late stage. However, for some patients, late toxicities persist 
negatively affecting the patient’s quality of life (QoL). In the case of 
head and neck cancer patients, PRO surveys are recorded every week 
during the patient’s visit to the clinic and at different follow-up times 
after the treatment has concluded. In this paper, we model the PRO 
data as a time-series and apply Long-Short Term Memory (LSTM) 
neural networks for predicting symptom severity in the late stage. 
The PRO data used in this project corresponds to MD Anderson 
Symptom Inventory (MDASI) questionnaires collected from head 
and neck cancer patients treated at the MD Anderson Cancer Center. 
We show that the LSTM model is effective in predicting symptom 
ratings under the RMSE and NRMSE metrics. Our experiments 
show that the LSTM model also outperforms other machine learning 
models and time-series prediction models for these data. 


CCS Concepts 


¢ Computing methodologies — Neural networks. 
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1 Introduction 

During head and neck cancer treatment, patients may experi- 
ence different symptoms with different severity during and after 
treatment [4, 27, 28]. A commonly used way to monitor patients’ 
symptoms is to record symptom severity or occurrence through a 
questionnaire survey, which is commonly known as Patient-Reported 
Outcomes (PRO) data. Much research is done over these PRO data 
in order to identify symptoms in an early stage and guide treatment 
decisions, as well as investigating the relationships among these 
symptoms [19, 24]. In this work, we use data collected from head 
and neck cancer patients treated at the M.D. Anderson Cancer Center 
using the M.D. Anderson Symptom Inventory questionnaire [5] and 
more specifically, the Head-Neck Module (MDASI-HN) [22]. The 
module is comprised of 28 questions, 13 referring to core symptoms 
related to cancer (systemic), 9 to head and neck symptoms (local), 
and the remaining 6 to symptom-burden interference with daily ac- 
tivities (life general). Patients rated the severity of their symptoms 
on a scale of 0 to 10 with 0 being mild or no existence and 10 being 
very severe (the worst imaginable). During treatment, symptoms are 
experienced with greater severity than after treatment. Ideally, we 
would like to see that all symptoms have receded in the late stage 
(e.g. a year after treatment), but in some cases, symptoms persist 
affecting the Quality-of-Life (QoL) of the patients in the long term. 

Previous research over the MDASI-HN PRO data applies factor 
analysis and cluster analysis to cluster and investigate symptom 
progression [25]. These researches look at a particular snapshot in 
time to cluster either the patients, by their experienced symptoms, 
or the symptoms given the patient’s ratings. In this work, we ap- 
proached the problem from a different perspective by modeling the 
PRO data as a time-series and applying the Long-Short Term Mem- 
ory (LSTM) Neural Network model [12] to predict late symptom’s 
rating 6 weeks and 12 months after treatment. Since PRO data is 
self-reported, patients may skip questions or entire questionnaires 
altogether, resulting in many missing values. To overcome this issue, 
we applied several methods for missing data imputation and evaluate 
the performance of the LSTM model for each method. We show that 
the LSTM model is effective in predicting symptom ratings under the 
RMSE and NRMSE metrics. In our experiments, the LSTM model 
also outperforms other machine learning models. Furthermore, we 
show that for this particular task as it has been observed in other 
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domains, the more data used, the better the model obtained, even 
when the data needs to be imputed. 

The main contributions of this paper can be summarized as fol- 
lows. This is the first work that looks at predicting late toxicity for 
head and neck cancer patients using LSTM. We evaluate different 
imputation methods for completing the data, including applying 
LSTM recursively. We compare the LSTM performance against 
ARIMA and other machine learning models. 


2 Related Work 

MDASI-HN PRO Data. PRO data has been widely collected phys- 
ically and electronically in the clinical area since it has an important 
meaning of evaluating the treatment benefits [6]. The PRO data used 
in this project is an MDASI-HN [22] questionnaire with 28 symp- 
toms to be rated on a scale from 0 to 10 with 0 being mild or no 
symptoms and 10 being very severe, during and after treatment. As 
shown in Table 1, the 28 symptoms can be divided into three types of 
toxicity. All patients are asked to fill MDASI-HN surveys before the 
start of treatment (baseline) and then weekly for the 6 weeks of the 
duration of the treatment. Patients are also asked at their follow-up 
visits 6 weeks, 6 months, and 12 months after treatment. 

Using the MDASI-HN PRO data, several studies focus on iden- 

tifying symptom clusters at a single timepoint [10, 14, 23]. Prior 
research mainly used two methods to find symptom clusters, one is 
factor analysis such as principal component analysis and the other 
one is cluster analysis such as hierarchical agglomerative cluster- 
ing [2, 7, 11, 25]. These studies focus on a single time point analysis, 
whereas we model the PRO data as a time series. 
Time Series Prediction and Imputation. Prediction of time series 
is typically done by looking at the previous values in the series and 
deciding the value at the current time step. Auto-regressive Inte- 
grated Moving Average (ARIMA) [1] is acommonly used method 
for the prediction of time series. The model combines the Auto- 
regressive (AR) and Moving Average (MA) models that are suitable 
for univariate time series modeling. In the AR model, the output 
depends on its lags while in the MA model, the output depends 
only on the lagged forecast errors. More recently, Long short term 
memory (LSTM) Recurrent Neural Networks have gained more 
popularity in time series prediction [9] and healthcare domain [15]. 
Specifically, LSTM networks were used to mimic the pathologist 
decision and other diagnostic applications [18, 30]; LSTM networks 
were used to recognize sleep patterns in multi-variate time-series 
clinical measurements [16]. However, to the best of our knowledge, 
it has not been previously applied to PRO data. 

Data imputation methods such as Multiple Imputation by Chained 
Equations (MICE) [3, 21], linear regression, Kalman filtering [13], 
among others, can be used to impute time series data. 


3 Proposed Approach 

In this section, we first describe the methodological approach 
including data pre-processing and the methods used for data imputa- 
tion. 


3.1 Long Short Term Memory (LSTM) 


Since PRO data with patients self-reporting on the severity of their 
symptoms is collected over time, we model it as a time series. If 
we can learn from the patients’ answers over a period of time and 
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Toxicity | Symptoms 

Systemic | fatigue, constipation, nausea, sleep, memory, ap- 
petite, drowsy, vomit, numb 

Local pain, mucus, swallow, choke, voice, skin, taste, mu- 


cositis, teeth, shortness of breath (SOB), dry mouth 
Life gen-| general activity, mood, work, relations, walking, 
eral enjoy, distress, sad 

Table 1: The 28 symptoms in the MDASI-HN questionnaire 
grouped into three types of toxicities. 


predict their answers next week or in 6 months, we could proactively 
make recommendations to minimize the symptom’s burden and 
therefore, improve the patients’ quality of life. For example, we can 
record patients’ responses to different symptoms from week 0 to 
week 5 during the treatment and predict what the ratings for those 
symptoms would be in week 6. Such prediction, if accurate, can be 
useful to make patients aware of the risks and prescribe exercises or 
medication that can help patients cope with the symptoms to avoid 
having to adjust treatment and improve the long-term quality of life 
of the patients. 

Long short term memory (LSTM) neural networks are a type 
of recurrent neural network (RNN) proven effective in predicting 
time series [12]. Unlike a traditional neural network, LSTMs have a 
feedback structure to store the memory of the events happened in the 
past and use it as a parameter in prediction. The basic structure of 
an LSTM model takes 3 different pieces of information: the current 
input data, the short-term memory (hidden states) from the previ- 
ous cell, and the long-term memory (cell state). This 3-dimensional 
data structure (number of samples, number of time steps, number of 
parallel time series on features) is pushed through the LSTM gates, 
which are used to regulate the information to be kept or discarded, 
i.e. selectively remove any irrelevant information. The LSTM model 
is able to memorize the time-series pattern of each patient’s response 
and be able to predict late toxicity. Another advantage of the LSTM 
is the diversity of the inputs and outputs. LSTM can handle multi- 
ple predictions simultaneously. In a many-to-one mode, the LSTM 
would learn from many patients and predict one symptom. In a 
many-to-many mode, the LSTM would learn from many patients 
and predict all 28 symptoms for the test data. Taking advantage of 
this, we are able to generate predictions for all 28 symptoms using 
one trained LSTM with many-to-many mode. 

To feed the data into the LSTM model, the original data was 
transformed into a 3-dimensional array where the first dimension 
corresponds to the patients, the second dimension to the time steps, 
and the third dimension to the symptoms. The number of patients 
corresponds to the number of samples in the training data. The 
number of time steps depends on what late toxicity we are evaluating. 
For the toxicity at 6 weeks after treatment, the number of time steps 
is 6, while for the toxicity at 12 months after treatment, the number 
of time steps is 11. The number of symptoms is always 28. 


3.2 Data Imputation 


Many of the symptom severity scores in the PRO data have NaN 
values. Some patients do not have enough follow-up time to collect 
later time points. Like many other machine learning models, LSTM 
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requires complete data. As a proof of concept, we first consider three 
imputation methods: Linear interpolation, Kalman interpolation [13], 
and Multiple Imputation by Chained Equations (MICE) [3]. 

Both Linear and Kalman interpolations are uni-variable imputa- 
tions. The first non-NA value was replicated to the start of the time 
series and the last-NA value was replicated to the end of the time 
series. Kalman smoothing requires at least three observations. When 
the time series contained less than 3 observations, the spline method 
was used to interpolate. Since the MDASI-HN ratings range from 0 
to 10, imputed values that resulted in negative values were replaced 
by 0, and values larger than 10 were replaced by 10. Before applying 
the Kalman smoothing, the time series for the patient was scaled 
between 0 and 1, imputation applied, and then the data was scaled 
back to the original 0-10 scale. 

MICE [3] is a multivariate imputation so the time series for all 
the patients are used simultaneously. That is, through an iterative 
series of predictive models, each specified variable in the data set 
is imputed using the other variables in the data set. The predictive 
model used to impute values can be various. In this work, we used 
predictive mean matching (pmm) where for each missing entry, the 
method forms a small set of candidates from all complete cases that 
have predicted values that is close to the predicted value for the 
missing entry and select a random candidate from the set to replace 
the missing value. Besides, we did 5 iterations for the imputation. 

We also apply the LSTM model recursively to predict intermedi- 
ate time points and then use the predicted data to train the next time 
step in the model. 

All the data imputations described were done using only the PRO 
time-series data without considering any other clinical variables such 
as gender, age, cancer staging, or treatment. 

Each imputation method produces a different complete version of 
the dataset. The complete data was then transformed into the correct 
input size of the LSTM model, and an LSTM model was trained on 
each of the imputed datasets. The predictions for all 28 symptoms 
were then compared using the Root Mean Squared Error (RMSE) 
and Normalized RMSE (NRMSE) metrics as defined below. 


RMSE(6) = {MSE(6) = VE((6 — 0)) (1) 


where 6 is the vector of observed values of the variable being pre- 
dicted and @ being the predicted values and E is the expected value. 


RMSE 
NRMSE = —————_ (2) 


Ymax — Ymin 
where Ymax and ymin are the maximum and minimum of actual 
data. 


4 Experimental Results 
4.1 Experimental Setup 


MDASI questionnaires were collected from 823 patients weekly 
for 6-weeks during treatment and for 3-time points after treatment 
(6-weeks, 6 Months, and 12 Months). The original data was split 
into two series: from baseline to 6-weeks after treatment, and from 
baseline to 12-month after treatment. We then applied the three in- 
terpolation methods on the two versions of the data and generated 6 
imputed data sets. The time point to be predicted was not imputed 
and patients with missing surveys for that time point were excluded 
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Figure 1: Average PRO scores trajectory and 95% confidence 
weekly during, 6 weeks, 3-6, 12 and 18-24 months after treat- 
ment for all 28 toxicities included in this study among all pa- 
tients. Abbreviations: sob= shortness of breath; numb= numb- 
ness; wk=week; M=month 


from the model. For testing, we only considered patients with com- 
plete data, that is, patients that completed all the PRO surveys for 
each time point. We ended up with a total of 651 and 483 patients 
predicting 6-weeks after treatment and for 12-months after treatment, 
respectively. All data imputation was done in R using the imputeTS 
[20] and mice packages. For MICE, the method used was predictive 
mean matching (pmm) and only one imputation was used with 5 
iterations. 

The LSTM models are built based on the open-source PyTorch 
framework. The input and output dimensions were set to 28. We 
used Mean Square Error (MSE) as the loss function and Stochastic 
Gradient Descent (SGD) as the optimizer with the learning rate of 
0.215. Using a grid search for parameter tuning, we set the number of 
hidden layers to 1 and the number of hidden dimensions to 8. During 
training, we used early stopping criteria to prevent over-fitting. The 
RMSE score is calculated by applying the square root function on 
the PyTorch MSE metric and the NRMSE score is calculated by 
dividing the RMSE with the range (max - min) of the actual data. 
All the networks were run on NVIDIA GeForce RTX 2070 GPU 
with 8GB of memory. 


4.2 PRO Data Summary 


The PRO data is summarized in Figure | using the average symptom 
severity for the different time points. As can be seen, patients ex- 
perience severe symptoms during the treatment and over time most 
of the symptoms return to baseline. However, for some symptoms 
and for some patients the toxicity persists even after 12 months after 
treatment. 
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Figure 2: LSTM training and validation performance for pre- 
dicting symptom severity 6-weeks after treatment using only 
complete data (original). 


4.3 Training LSTM using only complete data 


Figure 2 shows the training and validation performance for the 
LSTM model trained for predicting symptom severity 6-weeks after 
treatment when only complete data (without imputation) is used in 
the analysis. As can be seen, the validation RMSE score is higher 
than the training RMSE score all the time, which can be caused by 
insufficient samples in the training set. The same trend was observed 
for the NRMSE metric as well. 


4.4 Data Imputation Evaluation 


To evaluate the imputation methods, we compare the model perfor- 
mance in terms of RMSE and NRMSE when the LSTM was trained 
with complete data generated using the different imputation methods. 
Figure 3 shows the RMSE performance for training and validation of 
predicting symptom severity 6-weeks after treatment using complete 
datasets imputed with Linear interpolation, Kalman interpolation, 
MICE, and LSTM-recursive imputation methods. As can be seen, 
the model performance is better than when only complete data is 
used. Furthermore, the performance between the three imputation 
methods is similar and the models do not start to overfit until after 
600 epochs for Linear interpolation and almost 1500 for the recursive 
LSTM. The final model uses the early stopping strategy to improve 
model performance. 

Figure 4 shows the validation RMSE metric for the final mod- 
els trained over the complete data (original) and the four different 
imputed datasets. Linear imputation, in 6Wk after prediction, has 
the lowest validation RMSE score of and 1.9371 among the non- 
recurrent imputation methods whereas the original completed data 
has 2.1653 RMSE. The LSTM-recursive imputation shows the best 
overall performance for all metrics on late symptom predictions for 
6Wk and 12Mo. The LSTM-recursive has an overall lowest RMSE 
of 1.9142 and 1.4231 for 6Wk and 12Mo, respectively. The LSTM- 
recursive method also achieved the best overall performance under 
the NRMSE metric (not shown). Worth noting is the fact that, re- 
gardless of what imputation method was used, the models trained 
with imputed data, and therefore more data, performed better than 
the models trained with only complete data. 


4.5 LSTM performance on individual symptoms 


Next, we want to evaluate the LSTM performance on individual 
symptoms. For these experiments, we used the LSTM-recursive 
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Figure 3: LSTM RMSE performance for predicting symptom 
severity 6 weeks after treatment using Linear, Kalman, MICE, 
and LSTM imputed datasets. 


interpolation dataset, as it had the best overall RMSE score for all 
28 symptoms combined. 

Figure 5, shows the (a) RMSE and (b) NRMSE score for each 
symptom predicted 6-weeks and 12-month after treatment. As can 
be seen, symptoms like taste and dry-mouth have higher RMSE 
scores for predictions at both time points while other symptoms 
like vomit and nausea have lower RMSE scores. However, when we 
look at the NRMSE metrics for the same symptoms, we see that the 
NRMSE scores are higher for some of those symptoms with lower 
RMSE. The reason is that NRMSE takes into account the range of 
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Figure 4: LSTM performance in terms of RMSE metrics for the 
validation set when the model is trained with only complete data 
(original) vs. complete data imputed with linear interpolation, 
Kalman interpolation, MICE, or LSTM methods. 


the responses for the given time point. At 6-weeks after treatment, 
many symptoms still have a larger severity range than 12-months 
after treatment. Most of the symptoms have a lower RMSE score 
for 6-weeks after when compared to 12-months after treatment. The 
only exceptions are constipation and teeth, for which the RMSE 
score shows a slight increase for 12-months when compared to the 
6-weeks prediction. 


4.6 Comparison with other ML Models 


For comparison to other machine learning models, we focus on three 
prevalent symptoms: pain, taste and general activity. The reason that 
we limit the number of symptoms is that the other models considered 
cannot handle multiple predictions simultaneously. We compare 
the LSTM performance with the performance of six other popular 
models: supported vector machine (SVM), K-nearest neighbour 
(KNN), random forest (RF), Gaussian naive Bayes (GauNB), multi- 
layer perceptron (MLP), and ARIMA. 

Figure 6 shows the RMSE comparison for the 6-weeks after 
treatment prediction of pain, taste, and general activity symptoms 
for the 6 ML models and two LSTM: the one trained with the linear 
interpolated data (LSTM_L) and the recursive LSTM (LSTM*2). 
As can be seen, the LSTM model yields the lowest RMSE scores for 
all three symptoms. For pain, LSTM has an RMSE of 1.7794 which 
is over 20% lower than the MLP prediction, which has the second- 
lowest RMSE score of 2.2646. LSTM also outperformed all other 
models for the 12-month prediction of these symptoms but results 
are omitted for brevity. Interestingly, for taste and general activity, 
the LSTM trained over the data imputed using linear interpolation 
shows a slightly lower error than the recursive LSTM imputed data. 
A plausible explanation is that interpolation uses the before and 
after values in the series to impute the missing values and intuitively 
symptom severity follow a linear increase/decay. In contrast, LSTM 
only uses past information to forecast symptoms’ severity. In the 
future, it could be worth exploring alternatives that could combine 
both methods. 


5 Conclusion 

In this work, we used the PRO data from the MDASI-HN module 
and applied the LSTM model to predict late toxicity from head and 
neck cancer treatment. An accurate prediction can help identify per- 
sonalized symptom risk profiles and proactively prescribe exercises 
or medication that can help patients cope with symptoms to avoid 
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having to adjust treatment and improve the long-term quality of life 
of the patients. 

To deal with the missing data, we applied three interpolation meth- 
ods, linear, Kalman, and MICE. In addition, we also applied LSTM 
recursively to complete the data. We compared the performance of 
the LSTM model in terms of RMSE and NRMSE. The results show 
that using linear interpolation as the imputation method, though 
it is the simplest of the methods used, yielded better performance 
than Kalman and MICE imputations. While the LSTM imputation 
produces lower overall error measures than using linear interpola- 
tion, linear interpolation performed better than LSTM imputation 
for some individual symptoms. In all cases, the use of imputed data 
produced a better model than using only complete data. Furthermore, 
the LSTM model outperforms other machine learning models in the 
prediction of individual symptoms including pain, taste, and general 
activity symptoms. 

As future work, we would like to evaluate whether the inclusion 
of clinical data [8, 17, 26, 29] into the analysis would further im- 
prove the predictive power of the LSTM models. There are different 
ways in which these clinical variables, which are mostly categorical, 
can be leveraged into the LSTM model to further improve model 
performance and symptom prediction. 
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Figure 5: Validation RMSE (a) and NRMSE (b) scores for the prediction of individual symptoms 6Wk and 12Mo after treatment. 
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Figure 6: 6-week prediction RMSE on pain, taste, and general activity symptoms comparison for several machine learning models. 
Abbreviations: SVM = supported vector machine, KNN = K-nearest neighbour, RF = random forest, GauNB = Gaussian naive-bayes, 
MLP = multi-layer perceptron, ARIMA = Auto Regressive Integrated Moving Average, LSTM_L = LSTM on Linear Imputed data, 
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ABSTRACT 


The computer science community is paying more and more at- 
tention to data due to its crucial role in performing analysis and 
prediction. Researchers have proposed many data containers such 
as files, databases, data warehouses, cloud systems, and recently 
data lakes in the last decade. The latter enables holding data in 
its native format, making it suitable for performing massive data 
prediction, particularly for real-time application development. Al- 
though data lake is well adopted in the computer science industry, 
its acceptance by the research community is still in its infancy stage. 
This paper sheds light on existing works for performing analysis 
and predictions on data placed in data lakes. Our study reveals the 
necessary data management steps, which need to be followed in 
a decision process, and the requirements to be respected, namely 
curation, quality evaluation, privacy-preservation, and prediction. 
This study aims to categorize and analyze proposals related to each 
step mentioned above. 
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1 INTRODUCTION 


The last few decades have seen the emergence of several technolo- 
gies related to data and their management. Consequently, many 
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structures were proposed to hold, regroup and query the data like 
files, databases, data warehouses, and recently data lakes. Accord- 
ing to Fang [7], the data lake is a methodology created by internet 
companies to handle the scale of web data and to perform new types. 
It can be described as a massive data repository based on low-cost 
technologies that improves the capture, refinement, archival, and 
exploration of raw data within enterprise transformations. This 
emergent technology is characterized by its ability to carry mas- 
sive data in a very heterogeneous form (i.e., non-structured, semi- 
structured, and structured) at the same time [13]. This valuable 
characteristic had increased more and more the data lake usage, 
especially for putting in place real-time applications when there is 
a time constraint preventing from performing a process of unifying 
data schemes before loading data to a repository. 

Unfortunately, despite the bright side of the data lake, its adop- 
tion faces many difficulties for several reasons. First, ingesting data 
from multi-sources raises many questions about the quality of the 
ingested data. Data quality may be described as "fitness for use" 
which consists of assessing the quality of the data according to its 
context of use. Data quality may be appropriate for one use, but 
may not be of sufficient quality for another use [16]. In addition 
to data and source quality, ensuring the privacy of data carried in 
the data lake is raised by professionals working with sensitive data 
(e.g., healthcare, governmental, social, etc.). Privacy can be defined 
as “the claim of individuals, groups, or institutions to decide for 
themselves when, how and to what extent information about them 
is communicated to others” [11]. To overcome these challenges, 
scientists have started to find mechanisms to curate data, ensure 
their quality and preserve privacy. This work aims to categorize 
the existing works and to identify the open issues regarding data 
management in a data lake. Thus, we consider that it is necessary 
to apply first comprehensive analysis of the mechanisms used to 
manage data in a data lake and perform research using curated data 
while ensuring the quality of data and preserving their providers’ 
privacy. 

For this purpose, we propose in this paper a systematic map- 
ping to provide a classification scheme of articles by applying a 
systematic mapping method [15] which consists of the following 
five inter-dependent steps: (i) Definition of the research scope, (ii) 
Conduct search to identify all papers, (iii) Screening of documents 
to select the relevant ones, (iv) Definition of a classification scheme 
and (v) Sorting papers according to the classification scheme. After 
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that, we discuss the obtained results and shed light on open issues 
related to data management in data lakes. 

This paper is organized as follows: Section 2 depicts a scenario 
of performing predictions using data stored in a data lake. Section 
3 presents the systematic mapping method and finally, Section 4 
proposes a discussion of the systematic mapping results and the 
open issues. 


2 APPLICATION SCENARIO 


To motivate the use of data lake for data prediction, we consider 
the scenario of disease prediction and recommendation of actions 
to be taken to manage a crisis situation. The disease control and 
prevention center wants to take preventive actions after identifying 
a new unknown virus. The detected virus spread quickly and could 
be considered as a high risk for many countries. However, disease 
prediction and required actions recommendation need to be taken 
in real-time or near real-time on multi-source of data, that can be 
sometimes complex. Therefore, there is a need to store data into a 
data container that can hold heterogeneous data sources and per- 
form predictions in a very brief delay. The disease prevention and 
control center can use a crisis analysis and management system to 
perform data analysis and generate predictions and health recom- 
mendations. The data ingested into this system can be carried into 
a data lake. Once a user’s query is received, the system is supposed 
to perform a data curation and integration process to predict the 
disease and then generate suitable recommendations. Such a sys- 
tem should maintain a degree of intelligence and adaptability to fit 
the dynamic change of the global context (i.e., emergency, crisis, 
etc.). Furthermore, it should adapt to the user’s context, namely 
the purpose expected by the user. Such a system has to be also 
a multi-purpose system that may be used by multiple users for 
different purposes. 

We assume that the system can be made up of the following 
components: data ingestion, curation, integration, and prediction. 
These components represent the data management steps in a data 
lake. In what follows, we present the challenges related to each 
data management step using a systematic mapping process. 


3 SYSTEMATIC MAPPING 


To have a global view and analyze curation, quality evaluation, 
privacy-preserving, and prediction using data placed in the data 
lake, we performed a systematic mapping as defined in [15] and 
described hereafter. 


3.1 Step 1: Definition of Research Scope 


The first step consists of defining research questions to query. As 
aforementioned, the aim of our study is (i) to categorize and quan- 
tify research contributions to data curation in the data lake (ii) to 
categorize research contributions that considered data and source 
quality assessment and privacy-preserving in the data lake (iii) to 
categorize the studies that made predictions based on data carried 
out in the data lake, and (iv) to discover open issues and limitations 
in existing works. Thus, our study is guided by the questions men- 
tioned in Table 1. 
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Table 1: Research questions for the systematic mapping 


Research Question 

RQ1: What are the existing | This question helps to identify 

proposals for data curation | the existing approaches for en- 

in a data lake? richment, cleaning, linking, and 
structuring of data to get a well- 
organized data lake. 

RQ2: What are the pro- | This question aims at discov- 

posals for data and source | ering the existing approaches 


quality evaluation and | of data quality evaluation and 


privacy-preserving? privacy-preserving to guarantee 
the quality of predictions and the 
anonymity of data providers. 

RQ3: How did the pub- | This question helps to identify the 
lished papers deal with pre- | proposed prediction approaches 
diction and reasoning? using data sources placed in the 


data lake. 


3.2 Step 2: Search Conducting 


Search conducting consists of collecting papers from relevant scien- 
tific search engines/ databases, namely ACM Digital Library!, IEEE 
Xplore”, ScienceDirect, and Springer*. We have chosen a set of 
keywords to retrieve papers from databases. We used the following 
query, which is divided into three sub-queries: 


("Data Lake") AND ( Curation OR Enrichment OR Cleaning) 
("Data Lake") AND (Quality OR Privacy AND Preserving) 
("Data Lake") AND (Prediction OR Reasoning) 


At the end of this step, we found 1880 publications. 


3.3. Step 3: Paper Screening 


Following search conducting, we apply paper screening by defining 
a set of inclusion and exclusion criteria are defined. We considered 
only papers in English and treating data lakes by proposing an 
original work such as an approach, a method, a framework, etc. We 
excluded tables of contents, summaries, abstracts, forewords, en- 
cyclopedias, books, editorials, reports, studies, reviews, duplicates, 
and informative papers. By applying the filtering process, only 98 


papers” were included while 1782 were excluded. 


3.4 Step 4: Keywording Using Abstracts 


This step consists of defining the classification scheme. First, we 
define the facets that combine the classification dimensions. After- 
ward, we consider the relevant frequent words in papers’ abstracts 
as dimensions. We define six facets for classifying challenges in 
data lake: 


e Application domains: it deals with the field in which the 
authors present their proposals, such as agriculture, biology, 


‘https://dl.acm.org/ 
“https://ieeexplore.ieee.org/Xplore/home.jsp 
Shttps://www.sciencedirect.com/ 
“https://link.springer.com/ 
*https://zenodo.org/record/4749337#.YJqT_bUzZPY 
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Figure 1: Classification scheme facets and dimensions 


business, healthcare, industry, policing, environment, social, 
smart cities, and sports. 

e Type of the proposal: it concerns the proposal’s type, such 
as approach, architecture, framework, method, model, ser- 
vice, technique, and tool. 

e Curation: it reveals the proposed curation technique, such 
as data cleaning, enrichment, metadata, and schema. 


e Privacy: it concerns the used technique for privacy-preserving, 


such as access control, anonymization, blockchain, encryp- 
tion, machine learning, and policy. 

e Quality: it deals with the quality dimensions considered in 
the proposal, such as availability, reliability, response time, 
reputation, cost, completeness, relevancy, trustworthiness, 
currency, accessibility, consistency, accuracy, uniqueness, 
believability, speed, provenance, and timeliness. 

e Analysis & prediction: it mentions the used technique for 
data analysis and prediction, such as aggregation, deep learn- 
ing, machine learning, matrix factorization, probabilistic 
model, and rough set. 


3.5 Step 5: Data Extraction and Mapping 
Process 


In this step, we perform the mapping process to provide answers 
to our research questions. Thus, we combine the facets, and we 
present the results in bubble charts°. 

RQ1: What are the existing proposals for data curation in a 
data lake? 

By combining contribution with curation facets, we observe that 
most of the papers are presenting approaches (67.80 %). We noticed 
that more than one-third of the contributions are addressing the 
enrichment. The identified proposals are also handling Schema 
(30.51%) and Metadata (16.95%). While the least addressed dimen- 
sion is Data Cleaning (10.17%). Thus, data enrichment and schema 
curation are the most addressed aspects. 


°https://zenodo.org/record/5014065#.YNIXFegzZPZ 
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RQ2: What are the proposals for data and source quality 
evaluation and privacy-preserving? 
We have combined the contribution facet with the quality and the 
privacy facets. We found that the most addressed quality dimension 
is completeness (21.21%) followed by provenance (12.12%). We think 
this can be explained by the nature of data stored in data lakes that 
are often incomplete and can lead to uncertainty. At the same time, 
the papers paid less attention to the other quality dimensions. As 
for privacy mechanisms, the most used are Blockchain (25%), En- 
cryption (20%), and Access Control (20%), while the least addressed 
are Policy and Machine Learning (10%). We point out that qual- 
ity dimensions like completeness and provenance are attracting 
researchers’ attention. As for privacy, blockchain is a promising 
research field. 

RQ3: How have the published papers treated prediction 
and reasoning? 
To answer this question, we have combined the prediction and the 
domain facets. We noticed that machine learning is the most used 
mechanism to analyze and predict, while there is no significant 
difference between the other mechanisms. Most of the contributions 
did not address a specific field, but the healthcare (24.24%) field 
had more attention than the other areas like sports (3.03%) and 
social domain (3.03%). We find that machine learning is the most 
used technique for prediction. Regarding application fields, several 
contributions are addressing the healthcare field, while most of the 
proposals did not imply an application field. 


4 DISCUSSION AND OPEN ISSUES 


Our systematic mapping shows that the proposals address the data 
placed in data lakes from different data management perspectives, 
such as curation, quality evaluation, privacy-preserving, and predic- 
tion. In this section, we present the open issues of each dimension. 


4.1 Curation 


After establishing the statistical study, we noticed that most of the 
contributions address data curation issues in a data lake compared 
to other dimensions such as quality assessment, privacy, and pre- 
diction. We previously outlined that data and source curation is a 
fundamental step in data lake’s data management (see Section 2). 
Since data sources are stored in the data lake in their native format, 
they need to be pretreated to be exploited in the following steps 
(i.e., integration and prediction). We classify curation tasks into 
three classes: (1) Data Curation, including data enrichment such as 
semantic enrichment and contextualization. Also, it includes data 
cleaning, which encompasses data deduplication and data repair. (2) 
Curation tasks related to Metadata such as metadata extraction and 
metadata modeling (3) Curation tasks related to schema like extrac- 
tion, matching, mapping, and evolution. We noticed that we cannot 
apply all curation approaches to all data types. Thus, some curation 
approaches like schema extraction are restricted to semi-structured 
and unstructured data. Regarding curation, we noticed that the use 
of machine learning techniques is still limited, especially for data 
cleaning and schema mapping. Thus, several curation tasks such 
as deduplication, spotting errors, and violation and repairing data 
are hard to be automated. Indeed, most of the studied curation ap- 
proaches rely on rule-based techniques, semantic techniques, or the 
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incorporation of machine learning with one of the above techniques. 
Consequently, we identified rule-based contributions ensuring cu- 
ration tasks like detecting violations such as [9]. Otherwise, we 
found that machine learning techniques are combined with other 
methods such as crowdsourcing to ensure the curation task like 
the work presented in [3]. The latter proposes a semi-automatic 
approach composed of automatic curation via curation services and 
manual annotations via crowdsourcing and experts’ annotations. 
The proposed curation services [4] representing a curation step 
(e.g., Linking dataset with knowledge base). These services rely on 
different techniques like rules, dictionaries, ontologies, and machine 
learning. Thus, authors have ensured the automatic completion of 
some curation tasks (i.e., via orchestrating curation services), but 
they still need human intervention. The incorporation of machine 
learning with other techniques can be explained by the subjective 
aspect of some curation tasks like detecting errors, which cannot 
be identified using a series of rules and require human intervention. 
Thus, full autonomy in performing curation and the use of machine 
learning and rules generalization are still open issues of curation. 


4.2 Quality 


Our systematic mapping has also considered the quality aspect. 
Data/source quality evaluation can be performed in the integration 
step, mainly when discovering data, which consists of finding a 
subset S of relevant data sources among the organization's many 
sources, appropriate to a user-supplied discovery query [5]. Our 
study shows that the authors focused mainly on completeness and 
provenance dimensions due to data incompleteness and uncertainty. 
Thus, they try to assess or improve the quality of these dimensions, 
primarily. The data quality depends on conducted curation tasks. 
We state that the authors in [2] insisted on data completeness due 
to its importance in data analytics. Thus, they propose a data im- 
putation approach to repair erroneous data that enhances analysis 
outcomes quality. Accordingly, while performing erroneous data 
repair, this negatively impacts the accuracy dimension. Similarly, 
achieving missing data repair negatively influence completeness 
and accuracy dimensions. The quality evaluation is not enough 
by itself. Indeed, to perform the quality evaluation, a conceptual 
background is needed to identify which information must be of 
high quality and the quality dimensions to be used for assessment 
[14]. Moreover, we identified variability and diversity of subjective 
and objective quality dimensions required to evaluate data and its 
source [12, 14]. Usually, the quality dimensions are defined at de- 
sign time and imply a domain expert’s involvement to identify the 
required quality dimensions for a particular data and data source, 
which is time-consuming and error-prone. Therefore, we think that 
implementing autonomous systems helps to identify the necessary 
quality dimensions at run-time. Such systems can adapt to several 
factors, such as data source characteristics (e.g., data source format) 
and user needs. We assume that the system described in Section 
2 is used by several users’ roles. For example, we distinguish two 
users ensuring different missions in a pandemic situation. Thus, 
their requirements and the expected timely data quality differ. Ac- 
cordingly, we think that an adaptive system can optimize the data 
evaluation process in terms of execution time and alignment with 
user expectations. 
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4.3 Privacy 


Privacy is of great importance not only to internet users but also to 
businesses, legal communities, and policymakers. Hence, managers 
can seek to predict which privacy-enhancing initiative so that to 
gain a competitive advantage [1]. Therefore, it is important to in- 
clude privacy-preserving mechanisms in each decision process that 
contains steps of collecting, accessing, and storing individuals’ data. 
As for data lakes, they are designed to carry various types of data 
sources coming from different data sources. Data can be internal 
or external to an organization and can also be sensitive. Data in 
different fields like healthcare and biology are still represented as 
data silos due to their sensitivity. Users are hence still afraid of 
leaking their data, such as medical analyzes and genetic informa- 
tion. Thus, a privacy-preserving mechanism is needed to prevent 
data leakage threats, especially for systems that carry individuals’ 
information like the one presented in Section 2. In the presented 
scenario, individuals data are collected from various sources such 
as hospitals, health institutes, government, smartphones, wearable 
health devices, and IoT. Thus, the collected data are mostly sensitive 
and must be protected from the above threats. Our study reveals 
that various techniques can be used to preserve privacy in data 
lakes and big data ecosystems, such as access control, anonymiza- 
tion, blockchain, encryption, machine learning, and the definition 
of policies. We think that, for systems similar to the presented 
one (Section 2), it is preliminary to involve access control mecha- 
nisms to grant permissions according to users’ roles. However, it 
also requires more sophisticated mechanisms to ensure anonymiza- 
tion and data encryption. However, when considering massive data, 
anonymization techniques may face a linking problem with original 
datasets. Regarding encryption, this technique may need impor- 
tant computational performances when dealing with a massive 
amount of data [8]. Thus, we think that these techniques should be 
readjusted to fit massive data processing. According to our study, 
the usage percentage of privacy-preservation techniques is almost 
close, where the most used one is blockchain. Since this latter is an 
emergent technology, it had received more interest from the scien- 
tific community. Thus, blockchain data is perfect for data analysis 
as it is secure and valuable. Blockchain would be an alternative 
to conventional privacy systems, such as access controls, which 
provide more convenience and security for users to disclose their 
personal information. However, according to our study, the use of 
this mechanism is still limited in data lakes. We think that it can be 
subject to many contributions in the future. 


4.4 Prediction 


The conducted study shows that data lakes offer a sophisticated 
infrastructure to perform predictions. Many companies propose 
a data lakes architecture as a service to visualize, extract dash- 
boards and perform machine learning, such as Microsoft Azure and 
Amazon AWS. 

Our study reveals that the most proposed contributions for pre- 
diction are generic and not restricted to a specific field, whereas 
the healthcare field has taken more attention than other areas. It 
can be explained by the nature of healthcare data, which can vary 
in its format (i.e., analyzes, medical images, patient records, etc.). 
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Data lakes can be more efficient in developing real-time applica- 
tions. Contrary to data warehouses, data are stored in data lakes 
using the Extract-Load-Transform (ELT) process instead of the 
Extract-Transform-Load (ETL) process. Thus, data lakes extract 
schema-on-read instead of schema-on-write. Postponement of the 
transformation step can save time, making data lakes more suit- 
able to perform real-time predictions. Therefore, we think that it 
is useful to use data lakes to achieve predictions and reach such 
optimization degrees. Since the data lake contains a huge amount 
of data, we also believe that it is necessary to improve the tech- 
niques used for data integration and prediction to cope with this 
volumetry. Furthermore, we think that the variety of data formats 
and structures may influence the prediction process. As presented 
in Section 2, data is ingested to data lake from different sources 
like health institutes, wearable health devices, social media, and 
smartphones. Thus, the data lake holds very heterogeneous data 
sources such as relational databases, images, text files, captured sig- 
nals, audio files, etc. These datasets are used during the prediction 
process to generate useful outcomes for the users. To the best of 
our knowledge, several predictors have proven their performances 
with a specific data format. For example, CNN is a deep learning 
architecture that is popular for image and video classification [6]. 
However, it may not be the most convenient predictor to use with 
another data format. Besides, datasets may concern different topics. 
Thus, the latter may influence the generated outcomes and their 
interpretability. For example, dealing with users Tweets differs from 
dealing with chest X-ray images. Thus, we think that prediction 
and interpretability generalization is a key challenge. On the other 
hand, we notice that ensemble learning is gaining more interest in 
recent years [10] that can be valuable in such a context. Ensemble 
methods like stacking allow the combination of multiple individual 
predictors for the final prediction [17]. Hence, We think that stack- 
ing is a promising method and is convenient for prediction using 
data stored in data lakes. 


5 CONCLUSION 


Since the data lake is still in its early stages, few reviews are pro- 
posed in the literature. These works have adequately defined the 
scope of the study. Nonetheless, they are based on a few papers 
and do not present a statistical analysis of the covered works nor 
a systematic structure. We strongly agree that the current state 
of the art in data lakes needs to be thoroughly examined to deter- 
mine statistics on what has been completed and what remains to 
be carried out. Our systematic mapping, the proposal of this paper, 
is a step in this direction. It broadens the breadth of the research 
by providing a deep analysis of the data lake and its related con- 
cepts for decision system implementation. Thus, we categorized 
and studied works related to data management in data lakes, which 
assisted us in identifying open issues for curation, quality evalua- 
tion, privacy-preserving, and prediction for data hosted in a data 
lake. Among the findings is that the use of machine learning and 
generalized rules for curation remains limited. To the best of our 
knowledge, the current proposals lack autonomy in curating var- 
ious data sources. Similarly, when it comes to quality evaluation, 
the current proposals cannot identify the required quality dimen- 
sions to judge the "fitness for use" of such data autonomously. In 
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terms of prediction, we believe that the data lake is valuable for 
prediction, particularly for real-time applications, due to its assets. 
Furthermore, we discovered few contributions addressing privacy- 
preserving in data lakes. However, we believe that the techniques 
used for privacy-preserving and prediction need to be improved to 
deal with the massive amounts of data in data lakes. As a result, it 
is critical to propose contributions in this direction. Following this 
study, we hope that our findings will assist researchers working 
on data lakes to identify existing gaps in each of the dimensions 
mentioned above. 
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ABSTRACT 


Big Data is becoming a substantial part of the decision-making 
processes in both industry and academia, especially in areas where 
Big Data may have a profound impact on businesses and society. 
However, as more data is being processed, data quality is becoming 
a genuine issue that negatively affects credibility of the systems we 
build because of the lack of visibility and transparency on the 
underlying data. Therefore, Big Data quality measurement is 
becoming increasingly necessary in assessing whether data can 
serve its purpose in a particular context (such as Big Data analytics, 
for example). This research addresses Big Data quality 
measurement modelling and automation by proposing a novel 
quality measurement framework for Big Data (MEGA) that 
objectively assesses the underlying quality characteristics of Big 
Data (also known as the V’s of Big Data) at each step of the Big 
Data Pipelines. Five of the Big Data V’s (Volume, Variety, 
Velocity, Veracity and Validity) are currently automated by the 
MEGA framework. In this paper, a new theoretically valid quality 
measurement model is proposed for an essential quality 
characteristic of Big Data, called Validity. The proposed 
measurement information model for Validity of Big Data is a 
hierarchy of 4 derived measures / indicators and 5 based measures. 
Validity measurement is illustrated on a running example. 
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1 Introduction 


Big data refers to the vast amount of digital data stored and 
originated from different sources of digital and physical 
environments. Big Data analysis has increased businesses’ ability 
to gain deeper understanding of their customers’ preferences and to 
focus their resources on rising profits [1]. A large variety of 
industries benefit greatly from the use of Big Data, from the 
healthcare industry to agriculture and farming ([2][3][4][5][6]). For 
instance, being able to model and predict health assessment from 
electronic health records is one of the many kinds of advancements 
that can be achieved by leveraging Big Data in the healthcare space 


([6][7}). 


Big Data analysis and interpretation depend highly on the quality 
of underlying data as an eminent factor for its maturity. Data isn’t 
always perfect and building models of data that hasn’t been 
properly assessed can lead to costly mistakes [8]. Hence, models 
that are built off processed Big Data can only be as good as the 
quality of the underlying data. Therefore, as Big Data Pipelines 
begin to handle larger amounts of data, the need for modeling and 
monitoring data quality becomes indispensable for the stakeholders 
involved. This need is addressed in this research by proposing a 
novel quality measurement framework for Big Data (MEGA) for 
modeling and monitoring the quality characteristics of Big Data 
(the V’s.), where data issues can be identified and analyzed 
continuously by integrating data quality measurement procedures 
within Big Data Pipelines phases [9]. The goal is to flag data quality 
issues before they propagate into the decision-making process [9]. 


MEGA framework focuses on ten of the intrinsic Big Data quality 

characteristics referred to as 10 V’s, namely: Volume, Variety, 
Velocity, Veracity, Vincularity, Validity, Value, Volatility, 
Valence and Vitality. Measurement information models were 
published for Volume, Velocity, Variety and Veracity ({10][11]). 


In this paper we propose a new hierarchical measurement of the Big 
Data quality characteristic referred to as Validity. The proposed 
measurement model is built upon; 1) the NIST (National Institute of 
Standards and Technology) taxonomy towards to _ the 
standardization of big data technology [12], ii) measurement 
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principles described in ISO/IEC/IEEE Std. 15939 [13], and 111) the 
hierarchical measurement models discussed in [10] and [11]. The 
newly proposed Validity measurements are validated theoretically 
using the representational theory of measurement [14]. It is to be 
noted that the remaining V’s (Vincularity, Value, Volatility, 
Valence and Vitality) targeted by MEGA framework are outside 
the scope of this paper and will be tackled in our future work. 


The rest of the paper is organized as follows: Section 2 summarizes 
the background knowledge needed to understand the rest of the 
paper. In Section 3, the proposed measurement model for Validity 
of Big Data is compared with the related work in the field and 
described in detail. The indicators of Validity’s characteristics 
Accuracy, Credibility and Compliance are explained in sections 4, 
5 and 6 (this includes describing the many base measures and 
derived measures needed to construct these indicators for Big 
Data). All measurements are illustrated on a running example and 
validated theoretically. The Validity indicator and its measurement 
hierarchy are depicted graphically in section 7. Section 8 concludes 
this paper and outlines our future research directions. 


2 Background: MEGA Framework 


Big data analysis and interpretation depend highly on the quality of 
underlying data as an eminent factor for its maturity. In this 
research we tackle the Big Data’s inherent quality characteristics 
known as the V’s of Big Data [9]. The MEGA framework 
automates the V’s measurement information models aimed at 
evaluating the quality aspects of Big Data. The MEGA approach 
aims to solve problems that include the ability to: 1) process both 
structured and unstructured data, ii) track a variety of base 
measures depending on the quality indicators defined for the V’s, 
ii) flag datasets that pass a certain quality threshold, and iv) define 
a general infrastructure for collecting, analyzing and reporting the 
V's measurement results for trustable and meaningful decision- 
making. 


2.1 The MEGA Architecture 


The MEGA architecture focuses on having quality policies that are 
flexible enough to target a variety of Big Data projects. It allows 
data engineers and users to select V’s according to their needs and 
provides them with more flexibility, while data is being evaluated 
by the Quality layer. The MEGA solution allows for data to travel 
from the left to the right of the Big Data Pipeline unimpeded 
according to the scheduling protocol. This permits the Big Data 
Quality layer to measure attributes in parallel, while the pipeline 
continues processing data. The Big Data Quality layer permits the 
user to halt the pipeline process until quality validation is 
completed, by specifying constraints in the Quality Policy 
Manager. The goal is to allow the user the most amount of freedom 
in terms of adapting the Big Data quality assessment to their 
specific needs at each phase of the pipeline. 


The MEGA framework currently supports the 3V’s measurement 
information model [10], the Veracity measurement information 
model [11], and the newly proposed Validity measurement 
information model described from section 3. For more details on 
the proposed MEGA architecture and its comparison with the 
related work ([16][17][18][19]) please refer to [9]. 
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2.2 NIST Taxonomy of Big Data and ISO 15939 


In this section we describe the foundation of the Validity 
measurement model built upon the NIST taxonomy [12], in which 
a hierarchy of roles/actors and activities including data elements as 
the smallest level, records as groups of elements, DS as groups of 
records and finally MDS. 


Data elements (DE) are considered as the smallest level and are 
populated by their actual value, constrained by its data type 
definition (e.g.: numeric, string, date) and chosen data formats. A 
data element can refer to a single token such as a word in the 
context of unstructured text, for example. 


Records (Rec) are in the form of structured, semi-structured and 
unstructured. Mobile and Web data (e.g.: online texts, images and 
videos) are examples of unstructured data. 


DataSets (DS) are considered as groups of records, and Multiple 
DataSets (MDS) are groups of DS with the emphasis on the 
integration and fuse of data. The model defines two types of 
measures: base measures and derived measures [13]. 


A base measure is defined in ISO/IEC/JIEEE Std. 15939 as 
functionally independent of other measures. 


A derived measure is defined as a measurement function of two 
or more values of base/derived measures. 


The novel Validity measurement information model proposed in 
this work defines how the relevant attributes are quantified and 
converted to indicators that provide a basis for decision-making. 
The Validity measures built hierarchically upon the model 
specified in section 2.2 are described in sections 3, 4, 5, 6 and 7. 


2.3 Representational Approach to Validation 


Validation is critical to the success of Big Data measurement. 
Measurement validation is “the act or process of ensuring that (a 
measure) reliably predicts or assesses a quality factor” [14]. The 
Validity measures are theoretically validated using the 
Representational Theory of measurement [14] with respect to 
Tracking and Consistency criteria introduced in [15]. 


Tracking Criterion assesses whether a measurement is capable of 
tracking changes in product or process quality over the life cycle of 
that product or process. 


Consistency Criterion assesses whether there is a consistency 
between the ranks of the characteristics of big data quality (V’s) 
and the ranks of the measurement values of the corresponding 
indicator for the same set. The change of ranks should be in the 
same direction in both quality characteristics and measurement 
values, that is, the order of preference of the Validity will be 
preserved in the measurement data. 


Tracking and consistency are a way to validate the representational 
condition without collecting and analyzing large amounts of 
measurement data, thus can be done manually. 


2.4 Overview of the 3V’s and Veracity 


The MEGA framework automates the 3V’s measurement 
information model proposed to quantify three aspects of Big Data 
— Volume, Velocity and Variety. Four levels of entities have been 
considered, derived from the underlying Big Data interoperability 
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framework NIST (National Institute of Standards and Technology) 
standard hierarchy of roles/actors and activities [12]. The 3V’s 
measures were validated theoretically based on the representational 
theory of measurement. For more details, please refer to [10]. 
Veracity is one of the characteristics of big data that complements 
the 3V’s of Big Data and refers to availability, accuracy, credibility, 
correctness and currentness quality characteristics of data defined 
in ISO/IEC DIS 25024 [20]. A measurement information model for 
Veracity of big data was built upon [10] and published in [11]. 


3 Measurement Information Model for Validity 


Validity of Big Data is defined in terms of its accuracy and 
correctness for the purpose of usage [20]. However, few studies 
have been done on the evaluation of data validity. 


3.1 Comparison with Related Work 


Big Data validity is measured in [21] from the perspective of 
completeness, correctness, and compatibility. It is used to indicate 
whether data meets the user-defined condition or falls within a user- 
defined range. The model proposed in [21] for measuring Validity 
is based on medium logic. In contrast, in our work we consider a 3- 
fold root cause of Validity inspired by the notions of ISO/ 25024 
data quality characteristics accuracy, credibility and compliance: 1) 
the accuracy of data in MDS, ii) the credibility of DS in MDS, and 
ili) the compliance of data elements in records, compliance of 
records in DS, and compliance of DS in MDS. 


3.2 Mapping of Validity to ISO/IEC DIC 25024 


Big Data validity is measured in this paper from the perspective of 
accuracy, credibility, and compliance, which are adapted and 
redefined in order to provide an evaluation of the Big Data Validity 
with respect to defined information needs of its measurement 
model (see Figure 1). 
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soijsiuayaeuey> 
e3ep pZ0SZ OSI 


/ 
\ 


Figure 1: Big Data Validity Mapping to IS025024 


The measurement information model for the Validity of Big Data 
defines 3 indicators for measuring accuracy, credibility, and 
compliance, as described in sections 4, 5 and 6. 


4 Accuracy Indicator (Acc) 


Big data accuracy is essential for the Validity of Big Data. The 
users of big data sets require the highest validity of their data, but 
it’s a well-known fact that big data is never 100% accurate. 
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4.1 Notion of Accuracy 


According to the dictionary definition, accuracy means “the quality 
or state of being correct or precise”. ISO/IEC DIC 25024 defines 
data accuracy as a degree to which “data has attributes that correctly 
represent the true value of the intended attribute of a concept or 
event in a specific context of use.” It also states that accuracy can 
be measured from the “inherent” point of view only. One of the 
ways to increase the accuracy is to match records and merge them 
if they relate to the common values of the data attributes. 
Consequently, we define the accuracy as a measure of the common 
information in DS relationships within MDS. 


4.2 Base Measures and Derived Measures 


To quantify objectively the common information within the 
multiple datasets (MDS), we use Emden’s information theory 
model [22]: we first abstract the MDS as an Attribute-Record table, 
where the rows represent all data elements in MDS, and the 
columns represent the records in MDS. The value of a cell of the 
resulting Attribute-Record table is set to ‘1’, if the data element is 
included in the record; otherwise, the value of the cell is ’0’. 


Base Measures Hacc and Hmax. We use the notion of entropy 
to quantify objectively the common information [22]. The 
measurement formula for calculating the entropy A_acc in the 
Accuracy measurement model is as follows: 


Hacc = logo(Lbd) —1/Lbd * & j1.4) pj loge (pj) (1) 


where Lhd is the count of the total number of records in the MDS, 
k is the number of different column configurations in the Attribute- 
Record table corresponding to MDS, and p; is the number of 
columns with the same configuration so that, 


Lbd = Dieta Pj (2) 


The value of H_acc varies according to the diversity of the column 
configurations: common (repeated) configurations in the Attribute- 
Record table (representing duplicated records with the same values 
of the data attributes) will lower the entropy A_acc, while diversity 
of the records will increase H_acc . H_acc = 0 when all records in 
MDS contain all values of all data attributes. That is, k =/ and pi 
= Lbd. The values of H_acc for a given DS vary between 0 and 
Aimax calculated for a specific MDS, where Aimax represents the 
maximum entropy for that MDS when all records are different and 
thus there is no common information within MDS. Amax = 
log2(Lbd) when all records in MDS are distinct, corresponding to 
the best-case scenario where there is no need to merge records: 


Lbd =k and pj =1, Vj=[1..k] (3) 
The unit of measurement is information bit. 


Derived Measure Acc. In order to measure accuracy 
independently of the volume of the MDS, we propose to normalize 
the entropy H_acc measure with Hmax, Hence, the measurement 
function for the Accuracy indicator is: 


Acc (MDS) = H_acc / Hmax (4) 


Acc (MDS) normalizes entropy by the best-case scenario Aimax, 
thus normalization will allow data users to objectively compare 
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different DS within the MDS in terms of their common 
information. Acc value is a number between 0 and 1, 0 meaning the 
worst case (all data elements are common for all records), and 1 
corresponding to the best-case scenario. 


4.3 Theoretical Validation of Accuracy Measures 


The Accuracy measures are assessed in this section with respect to 
the Tracking and Consistency criteria introduced in section 2.3. To 
illustrate the Acc indicator, we use an example of MDS at different 
time frames Tl and T2. MDSri shown in Figure 2, is a 
representation of data at T1 that will be used as an example of real- 
life Big Data. This will also be used to show how we can measure 
data elements in Big Data. For the purposes of theoretical 
validation, we present a modified case of MDSri where new 
records were added at time T2 (MDSrp, see Figure 4), T2>T1. In 
both cases, there are no duplicated records in the multiple datasets 
MDSrti or MDSrtz. All records were mapped to Attribute-Record 
tables similar to the method descried in [22] and the entropy was 
calculated as described in section 4.2. 


Dataset 1 Dataset 3 


[| Dataseta [Datasets 


Figure 2: Big Dataset Illustration of Accuracy at T1 


Intuitively, we expect the value of Accuracy for both multiple 
datasets MDSti and MDSr7: to correspond to the best-case scenario 
of maximum accuracy, where there is no need to merge records. 
From our intuitive understanding, we expect the entropy H_accr2 
of the DS depicted in Figure 3 to be higher than H_accr;. We also 
expect the values of the based measure Aimax for MDSr2 to be 
higher than the corresponding values in MDS71 due to the increased 
size of the DS at time T2. 


The values of the variables Lhd, k and p at time T1 are: 


Lbd = 9, k = 9, and pi =1 Vj={1..k]. The entropy value at time T1 
for the DS depicted in Figure 3 is: H_acc ri=3.1699 


At time T2 the values of the variables are: Lbd = 18, k = 14, pi = 2 
for i € {1, 6, 12, 13} and pi = / for the remaining column 
configurations. 


The entropy value at time T2 for the DS depicted in Figure 3 is: 
H_acc 12= 4.1699. As expected, H_acc 12 > H_acc 11. Similarly 
Aimax 12=1og2(18) > Hmax 71 =log2(9) 


As expected, the value of the Acc measure indicates maximum 
Accuracy result (ACC(MDS) = 100%) for both MDSri and MDS72. 


Dataset 1 Dataset 2 Dataset 3 


IName___|Salary__[Debt__[Name__[Salary__[Debt__[Name__[Salary__[Debt__| 


Figure 3: Big Dataset Illustration of Accuracy at T2 


To validate Tracking and Consistency criteria of the Accuracy 
measures on DS with duplicated records, we modify the MDS 
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shown in Figure 2 (MDSr1) and Figure 3 (MDSvz) as depicted in 
Figure 4 below: 


Dataset 3 
Name Salary Debt Name Salary Debt Name Salary Debt 
i vv O i , Chris i u 


3,000 
90,000 
10,000 6,000 


Figure 4: Illustration of Accuracy Measurement with 
duplicated records (T1 &T2) 


Given that there are duplicated records in Figure 4 at both time T1 
and time T2 MDS, intuitively we would expect the Accuracy of 
MDS’712 to be higher than the Accuracy of MDS’7; due to the 
relatively lower number of duplicated records. Our intuition would 
also expect not only H_acc’r2 >H_acc’ri, but also H_acc 1; > 
H_acce’r1 and H_acc 12 > H_acc’r2 The above intuitive 
expectations are confirmed by the measurement results, where 
H_ace’r =2.7254 and H_acc’m=3.7254. 


The value of Acc measure is calculated using the formula (4): 


Acc (MDS’711) = H_ace’11 / Hmax’ 11, where Hmax’ 1; = 3.17, thus 
Acc (MDS’11) = 86.97%. Similarly, Acc (MDS’12) = 89.34% 


Based on the analysis of the above measurement results we can 
conclude that both Tacking, and Consistency criteria hold for the 
Accuracy measures, thus we proved their theoretical validity. 


4.4 Accuracy Indicator for MDS 


We propose to visualize the Accuracy indicator results by depicting 
the Acc values of MDS graphically; this will allow data engineers 
to easily trace the accuracy of individual DS and identify those 
MDS whose records need to be analyzed further and merged, where 
applicable. Figure 5 illustrates the Accuracy Profile graph for the 
four MDS (MDSt1, MDSt2 ,MDS’11 and MDS’ 7? ). 


Multiple 
datasets 
MDS't12 89.34% 
MDS't1 86.97% 
MDSrt2 100% 
100% 
MDS Petal 


Acc 
Figure 5: Illustration of the Accuracy Profile Graph for MDS 
Hence, Acc indicator of Validity not only allows to objectively 
compare different MDS in terms of their accuracy, but likewise 


visualizes the Accuracy measurement results to facilitate the 
decision-making of the Big Data users. 


5 Credibility Indicator (Cre) 


5.1 Notion of Credibility 


The notion of credibility in ISO/IEC DIS 25024 standard represents 
“the degree to which data has attributes that are true and accepted 
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by users in a specific context of use” [20]. In order to measure 
credibility, we assume the existence of up-to-date information on 
qualified sources. 


5.2 Base Measures and Derived Measures 


Base Measures Nds and Nds_cr. Let cregoyrce: DS > [0...1] be 
a function that returns 1 if the source of a DS is qualified for use, 
or 0 otherwise. Two base measures are defined in the measurement 
model of credibility: 


e Number of DS in Big Data set (Nds). Nds is a simple counting 
of the total number of DS in MDS. 


e Number of credible DS in Big Data set (Nds_cr), where the 
measurement method is counting of DS with qualified 
sources: 


Nds_cr(MDS) = yYv ps empDs CT €source (DS) (5) 


Derived measures Cre. We define Credibility measure as a ratio 
of the total number of credible DS and all DS. The measurement 
function for Cre is specified as follows: 


Cre (MDS) = Nds,,(MDS)/Nds(MDS) (6) 


Regular collection of Cre measurement data would allow 
practitioners to gain timely valuable control over the credibility of 
the data sources in their MDS and eventually trace the MDS 
credibility over time. 


5.3 Illustration of the Cre Indicator 


We illustrate the Cre indicator through a simple extract of MDS at 
two-time frames. The first base measure we need to calculate is 
Nds, which is defined as the number of DS present in MDS: 


Nds = 3. Next, we assess the credibility of the DS sources, where 
cre_source (DS) is set to 0 or 1 value, depending on whether or not 
a specific DS is credible. In this example we assume that DS: and 
DS3 are credible: cre_source (DS1) = cre_source (DS3) = 1 and 


cre_source (DS2) = 0. We also assume that the credibility of the DS 
at times Tl and T2 remain the same. Next, we normalize the 
credibility of MDS by Nds. The measurement value of Cre (MDS) 
is 7% (or 66%), which indicated the proportion of credible DS. 


Salary Debt Name Salary Debt Name Salary Debt 
50,000 10,000 i 50,000 90,000 
50,000 10,000 40,000 100 


75,000 400 40,000 100 
55,000 3,000 70,000 30,000 
45,000 90,000 70,000 30,000 


4,000 10,000 6,000 0 


Figure 6: Illustration of Credibility with Duplicated Records 
(Time T1 and T2) 


5.4 Theoretical Validation of Credibility 


In this section we assess the Tracking and Consistency criteria of 
the Cre measures. Intuitively, the more credible sources that exist 
in MDS, the higher the value of credibility in MDS. In order the 
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validate the Cre measurement values against this intuitive 
expectation, we fix the value of Nds through time and track the 
changes of cre_source (DS) data. In the previous example (see 
section 5.3) the value of cre_source (DS) doesn’t change from T1 
to T2, neither does cre (MDS), as expected. If we, however, assume 
that at time T2 the credibility of DS’; changes (cre_source (DS’3) 
= / at time T2), then we expect the credibility of the MDS72 to 
increase. The measurement value of cre (MDSz2) proves that the 
intuitive expectation is preserved by the measurement value of 
Credibility indicators, which increased from 0.66 to 1 (meaning 
100% credible MDS). 


These calculations establish the theoretical validity of the 
Credibility measures, as required by the representational theory of 
measurement. 


6 Compliance Indicators (rec_ Comp, DS_comp, 
MDS_comp) 


In this section we introduce a hierarchy of measures for 
compliance, which evaluate compliance at the record (REC), 
dataset (DS) and multiple datasets (MDS) levels reflecting the 
corresponding entities in the NIST hierarchy. 


6.1 Notion of Compliance 


Compliance is defined as the degree to which data has attributes 
that adhere to standards, conventions or regulations in force and 
similar rules relating to data quality in specific context of use, 
according to ISO/IEC 25024 definition. This means that whether or 
not a data element is deemed as compliant depends on the judgment 
of the data scientists, the organization, standards and local laws and 
regulations. 


6.2 Base Measures and Derived Measures 


We define a function CompSource: Rec > [0...1] that returns 1 if the 
source record is compliant with the set of standards that have been 
set by the researchers; otherwise, the returned value is 0. 


Base Measure rec_comp. The base measure rec_comp counts the 
number of compliant records in a DS, as defined below: 


rec_comp(DS) = yYv rec € DS COMPsource (rec) (7) 


Derived Measures DS_Comp and MDS_Comp. The proposed 
derived measure for Compliance is defined as a ratio of the Big 
Data entities (records, DS, or MDS) that have values and/or format 
that conform to standards, conventions or regulations, divided by 
the total number of data entities. We propose two derived measures 
for the Compliance indicator measuring the above ratio at the level 
of a DS and the level of MDS as defined below: 


DataSet Compliance DScomp. The measurement function for 
Compliance along a DS is defined by DS_comp (DS) as follows, 


(8) 


rec_comp(DS) 


DS_comp(DS) = ——— (DS) 


where Ldst (DS) is the number of records in a specific DS. 
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Multiple DataSets Compliance MDS_Comp (MDS). Finally, we 
define a measurement function for quantifying objectively 
Compliance across all DS in MDS as follows: 


_ vos emps Nreccomp(DS) 
Regular collection of Compliance measurement data would be 
necessary to flag data that does not comply with local laws and 
regulations. For instance, in the Healthcare industry, HIPPA in the 
USA or PIPEDA in Canada require compliance with their privacy 
and data security regulations by law, thus the measurement of 
compliance for such sensitive data will become imperative. 


6.3 Illustration of the Compliance Indicators 


Figure 6 shows the same data as in Figure 7, where the data that is 
non-compliant is highlighted. We assume that for this example, all 
data elements in Salary or Debt columns must be numerical 
(commas are allowed). In this example, rec_Comp (DSi) = 2, 
rec_Comp (DS2) = I and rec_Comp (DS3) = 3 at time T1. The 
values of the Compliance Indicator at the DS level at time T1 are 
as follows: DS_Comp (DSi) = 0.66, DS_Comp (DS2) = 0.33 and 
DS_Comp (DS3) = 1. The result of the Compliance at time T1 in 
terms of MDS is defined to be the average compliance of all DS; 
MDS_Comp (MDSr1) = 0.66. We perform the same steps to 
measure Compliance at record, DS and MDS levels in Time T2. 
Finally, MDS_Comp (MDS72) = 0.72. As expected, MDS_Comp 
(MDS72) > MDS_Comp (MDSvr1) = 0.66 


Name Salary Debt Name Salary Debt Name Salary Debt 


Jill 50,000 10,000 i 50,000 90,000 
Jill 50,000 10,000 40,000 100 


Melvin 80,000 400 40,000 100 
Jacky 55,000 10,000 Robin 15,000 3,000 70,000 30,000 
Brook 45,000 1,500 70,000 30,000 
Zoro 400 6,000 0 


Figure 7. Illustration of non-compliant data in MDS 


6.4 Theoretical Validation of the Compliance 
Measures 


Based on the meaning of Compliance, the more credible datasets 
the Big Data contains, the larger the Cre indicator value. The 
perception of ‘more’ should be preserved in the mathematics of the 
measure: the more compliant records that exist in a particular 
dataset, the higher the rate of compliance as defined earlier. We 
validate theoretically Compliance by fixing the value of Nds 
(MDS) to 3 and tracking the change in MDS_Comp (MDS) through 
time: We see from the example above that, as the value of S Comp 
(DS1) and DS_Comp (DSz) at the DS level increases from T1 to T2, 
Compliance at the MDS level increases from 0.66 to 0.72. which 
represents a 9% increase. This is to be expected: the percent change 
of records compliance in the DS increases respectively by of 8.5% 
(DSi) and 9% (DSz). 


The above calculations establish the theoretical validity of the 
Compliance measures by demonstrating both the Tracking and 
Consistency criteria, as required by the representational theory of 
measurement. 
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6.5 Hierarchy of the Validity Measures 


The objective of this section is to define the Validity indicator Mval 
that would allow to objectively compare different Big Data sets in 
terms of the indicators Accuracy, Credibility and Compliance, and 
to present graphically this new hierarchical measurement model 
tailored specifically to the Validity of Big Data. 


6.6 Validity Indicator Mval 


Validity Indicator proposed in this research is defined as a vector 
Myval = (Acc(MDS), Cre(MDS), DS_comp(DS), MDS_comp(MDS)) 


that reflects correspondingly the accuracy, credibility and 
compliance of the underlying data at the level of DS or MDS. 


7 The Measurement Hierarchy 


The measurement information model proposed in this work is a 
hierarchical structure linking the goal of Big Data Validity to the 
relevant entities and attributes of concern, such as an entropy of a 
MDS, number of records, number of DS, etc. In our approach, the 
Validity characteristics were decomposed through three layers as 
depicted in Figure 8. 
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MDS_comp 
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Figure 8: Hierarchical Measurement Model of Validity 


The measurement information model defines how the relevant 
attributes are quantified and converted to indicators that provide a 
basis for decision-making. 


8 Conclusions and Future Work 


In this paper, we proposed a new theoretically valid measurement 
information model to evaluate Validity of Big Data in the context 
of the MEGA framework, applicable to a variety of existing Big 
Data Pipelines [9]. Four levels of entities have been considered in 
the definitions of the measures, as derived from the NIST 
hierarchy: data element, record, DS, and MDS. The model elements 
are compliant with ISO/TEC/IEEE Std. 15939 guidelines for their 
definitions, where five base measures are first defined, assembled 
into four derived measures, evolving into four indicators. The 
model is suitable for Big Data in any forms of structured, 
unstructured and semi-structured data. Theoretical validation of the 
Validity measures has been demonstrated. 
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Rigorous Measurement Model for Validity of Big Data: MEGA 
Approach 


We illustrated the Validity measurement model by collecting 
measurement data on small examples and showed how the Validity 
indicators Accuracy, Credibility and Compliance can be used to 
monitor data quality issues that may arise. The relevance of such 
model for the industry can be illustrated with three simple examples 
of usage of these measures and indicators: 


Accuracy (Acc) and its profile: Accuracy is useful to 
objectively compare MDS in terms of their information 
content, as well as to oversight variations of Big Data 
Accuracy over time. A decrease of Accuracy might trigger 
investigation as actions might be needed to merge duplicated 
(common) information. 


Credibility (Cre) and its trend allows easy and objective 
comparisons of MDS in terms of their credibility. A 
Credibility trend showing a decrease might trigger 
investigation as a source of data could be damaged or 
unavailable. 


Compliance (Comp) at DS and MDS levels, and the 
corresponding trends: Compliance allows to track an 
important characteristic of Validity at the level of record, DS 
and MDS; decrease in the measurement results over time 
might trigger investigation of the potential legal issues with 
the usage of the Big Data. 


Our future research will enhance the theoretical findings presented 
in this paper with empirical evidence through evaluation of these 
measures with open-access data and industry data. 
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ABSTRACT 


At the end of 2019, the World Health Organization (WHO) referred 
that the Public Health Commission of Hubei Province, China, re- 
ported cases of severe and unknown pneumonia, characterized by 
fever, malaise, dry cough, dyspnoea and respiratory failure, which 
occurred in the urban area of Wuhan. A new coronavirus, SARS- 
CoV-2, was identified as responsible for the lung infection, now 
called COVID-19 (coronavirus disease 2019). Since then there has 
been an exponential growth of infections and at the beginning of 
March 2020 the WHO declared the epidemic a global emergency. An 
early diagnosis of those carrying the virus becomes crucial to contain 
the spread, morbidity and mortality of the pandemic. The definitive 
diagnosis is made through specific tests, among which imaging tests 
play an important role in the care path of the patient with suspected 
or confirmed COVID-19. Patients with serious COVID-19 typically 
experience viral pneumonia. 

In this paper we launch the idea to use the Multiple Instance 
Learning paradigm to classify pneumonia X-ray images, considering 
three different classes: radiographies of healthy people, radiogra- 
phies of people with bacterial pneumonia and of people with viral 
pneumonia. The proposed algorithms, which are very fast in prac- 
tice, appear promising especially if we take into account that no 
preprocessing technique has been used. 


CCS CONCEPTS 


¢ Applied computing — Health informatics; * Computing method- 


ologies — Machine learning. 
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1 INTRODUCTION 


The world is coping with the COVID-19 pandemic. COVID-19 is 
caused by a Severe Acute Respiratory Sindrome Coronavirus (SARS- 
CoV-2) and its common symptoms are: fever, dry cough, fatique, 
short breathing, vanishing of taste, loss of smell. The first known 
case of this novel Coronavirus disease was reported in Wuhan, China 
in the last days of 2019 [15] and since then, the virus propagates all 
over the world. On March 11, 2020 the WHO declared the epidemic 
a global emergency (pandemic). Lockdown measures and drastic 
restrictions of movements and social life are affecting the lives of 
billions of people. COVID-19 is the most significant global crisis 
since the Second World War, but its repercussions are exceeding 
those of a war. At the time of this writing, May 4, 2021, the total 
confirmed COVID-19 deaths according to WHO are 3.195.624 and 
this number is likely to be underestimated. The most effective way 
to limit the spread of COVID-19 and the number of deaths is to 
identify infected persons at an early stage of the disease and many 
different proposals have been investigated for the development of 
automatic screening of COVID-19 from medical images analysis. 

COVID-19 has interstitial pneumonia as the predominant clinical 
manifestation. When we talk about the interstitium we mean a par- 
ticular entity located between the alveolus and the capillaries, which 
is investigated mainly with radiological techniques. Radiological 
imaging does not represent a diagnostic criterion for Coronavirus 
(Sars-Cov2) infection, but it is able to highlight any pneumonia that 
can be associated with it, in this case it is possible to see an opacity 
on the radiograph, called thickening. The X-ray will show a greater 
extension of the pulmonary thickening: in the X-ray the lungs will 
appear, so to speak, more and more white. 
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Therefore, while huge challenges need to be faced, medical imag- 
ing analysis arises as a key factor in the screening of viral pneu- 
monia from bacteria pneumonia. In reasoning on the assessment 
of COVID-19, chest radiography (CXR) and computed tomogra- 
phy (CT) are used. CT imaging shows high sensitivity, but X-ray 
imaging is cheaper, easier to perform and in addition (portable) X- 
ray machines are much more available also in poor and developing 
countries. 

In [18] it is conducted a retrospective study of 338 patients (62% 
men; mean age 39 years) who presented to the emergency room in 
New York with confirmed COVID-19. They divided each patient’s 
chest X-ray into six zones (three per lung) and generated a severity 
score based on the presence or absence of opacity in each zone 
(maximum score six, Minimum score zero). 

The idea underlying this work arises within the described scenario, 
characterized by an intense activity of scientific research, aimed at 
supporting fast solutions for diagnostics on COVID-19, which is a 
special case of viral pneumonia. Considering that there are recurring 
features that characterize the radiographs of patients affected by 
viral pneumonia, we propose a chest X-ray classification technique 
based on the Multiple Instance Learning (MIL) approach. We have 
considered a subset of images taken from the public Kaggle chest X- 
ray dataset [16] from which we have randomly extracted 50 images 
related to radiography of healthy people, 50 of people with bacterial 
pneumonia and 50 of people with viral pneumonia. This data set is 
widely used in the literature in connection with specific COVID-19 
data sets, as reported in [2] (see for example [1, 12-14, 19]). 

The preliminary results we present appear encouraging, especially 
if we consider that no preprocessing technique has been used on the 
X-ray images. 

The paper is organized in the following way. In Section 2 we 
describe the MIL paradigm and the related techniques used for our 
preliminary experiments, presented in Section 3. Finally, in Section 
4 some conclusions are drawn, accompanied by some considerations 
on possible future research directions. 


2 MULTIPLE INSTANCE LEARNING 


Multiple Instance Learning [11] is a classification technique consist- 
ing in the separation of point sets: such sets are called bags and the 
points inside the sets are called instances. The main difference of a 
MIL approach with respect to the classical supervised classification 
is that in the learning phase only the class labels of the bags are 
known, while the class labels of the instances remain unknown. 

The MIL paradigm finds applications in a lot of contexts such 
as in text classification, bankruptcy prediction, image classification, 
speaker identification and so on. 

A particular role played by MIL is in medical image and video 
analysis, as shown in [17]. Diagnostics by means of image analysis 
is an important field in order to support physicians to have early di- 
agnoses. We focus on binary MIL classification with two classes of 
instances, on the basis of the the so-called standard MIL assumption, 
which considers positive a bag containing at least a positive instance 
and negative a bag containing only negative instances. Such assump- 
tion fits very well with diagnostics by images: in fact a patient is 
non-healthy (i.e. is positive) if his/her medical scan (bag) contains 
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at least an abnormal subregion and is healthy if all the subregions 
forming his/her medical scan are normal. 

In [7] a MIL approach has been used for melanoma detection on 
some clinical data constituted by color dermoscopic images, with 
the aim to discriminate between melanomas (positive images) and 
common nevi (negative images). The obtained results encourage to 
investigate possible use of MIL techniques also in viral pneumonia 
detection by means of chest X-rays images, which is the scope of 
this paper. In particular, using binary MIL classification techniques, 
our aim is to discriminate between X-rays images of healthy patients 
versus patients with bacteria pneumonia, healthy patients versus 
patients with viral pneumonia and patients with bacteria pneumo- 
nia versus patients with viral pneumonia. The results we present 
are preliminary since no image preprocessing technique has been 
adopted. 

The MIL techniques we use in this work for chest X-ray chest 
image classification fall into the instance-level approaches and they 
are object of the next subsections. Both of them are designed taking 
into account the so-called standard MIL assumption, stating that a 
bag is positive if it contains at least a positive bag and it is negative 
otherwise. 


2.1 The MIL-RL algorithm 


MIL-RL algorithm [5] is an instance-level technique based on solv- 
ing, by Lagrangian relaxation [10], the Support Vector Machine 
(SVM) type model proposed by Andrews et al. in [3]. Such model, 
providing an SVM separating hyperplane of the type 

H(w, b) 2 {x €R"™ | wx + b = 0}, (1) 


is the following: 


m k 
min, lw? +C>) +0) DS 
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>= ST Se eae 
2 
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where: 


e mis the number of positive bags; 

e k is the number of negative bags; 

e xj; is the j-th instance belonging to a bag; 

e J; is the index set corresponding to the instances of the i-th 
positive bag; 

e J; is the index set corresponding to the instances of the i-th 
negative bag. 
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Variables b and w correspond respectively to the bias and normal 
to the hyperplane, variable €; gives a measure of the misclassification 
error of the instance x;, while y; is the class label to be assigned to 
the instances of the positive bags. The positive parameter C tunes 
the weight between the maximization of the margin, obtained by 
minimizing the Euclidean norm of w, and the minimization of the 
misclassification errors of the instances. Finally, the constraints 


py ae Patan (3) 
joj; 
impose that, for each positive bag, at least one instance should be 
positive (i.e. with label equal to +1). 
Note that, when m = k = 1 and yj = +1 for any j, problem (2) 
reduces to the classical SVM quadratic program. 
The core of MIL-RL is to solve, at each iteration, the Lagrangian 
relaxation of problem (2), obtained by relaxing constraints (3): 
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(4) 
where A; > 0 is the i-th Lagrangian multiplier associated to the i-th 
constraint of the type (3). In [5] it has been shown that, considering 
the Lagrangian dual of the primal problem (2), in correspondence 
to the optimal solution there is no dual gap between the primal and 
dual objective functions. 


2.2 The mi-SPSVM algorithm 


Algorithm mi-SPSVM has been introduced in [8] and it exploits the 
good properties exhibited for supervised classification by the SVM 
technique in terms of accuracy and by the PSVM (Proximal Support 
Vector Machine) approach [9] in terms of efficiency. It computes a 
separating hyperplane of the type (1) by solving, at each iteration, 
the following quadratic problem: 


2 
i WwW y) 
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by varying of the sets J* and J~, which contain the indexes of the 


instances currently considered positive and negative, respectively. At 
the initialization step, J* contains the indexes of all the instances of 
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the positive bags, while J” contains the indexes of all the instances 
of the negative bags. Once an optimal solution, say (w*, b*, €*), to 
problem (5) has been computed, the two sets J* and J~ are updated 
in the following way: 


Pers and jose go Us 
where 

Ja {pe It \ S| wt xy + b* < -1}, 
with J* = {j7, i= 1,...,m| w*? xj + b* < -1} and 


i; 5 arg maxje(senyty (we! xj +b*}. 

Some comments on the updating of the sets J* and J~ are in order. 
A particular role in the definition of the set J is played by the set J*, 
introduced for taking into account constraints (3). We recall that such 
constraints impose the satisfaction of the standard MIL assumption, 
stating that, for each positive bag, at least one instance must be 
positive. At the current iteration, the set J* is the index set (subset of 
J*) corresponding to the instances closest, for each positive bag, to 
the current hyperplane H(w’%, b*) and strictly lying in the negative 
side with respect to it. If an index, say j; € J*, corresponding to 
one of such instances entered the set J, all the instances of the i-th 
positive bag would be considered negative by problem (5), favouring 
the violation of the standard MIL assumption. This is the reason why 
the indexes of J* are prevented from entering the set J”: in this way, 
for each positive bag, at least an index corresponding to one of its 
instances is guaranteed to be inside J*. 


3 > NUMERICAL RESULTS 


Algorithms MIL-RL and mi-SPSVM have been preliminary tested 
on 150 X-ray chest images, randomly taken from the public dataset 
[16] available at https://www.kaggle.com/paultimothymooney/chest- 
xray-pneumonia: 50 images are of healthy people (Figure 1), 50 
correspond to people with bacterial pneumonia (Figure 2) and 50 
are of people with viral pneumonia (Figure 3). 

We have used the same Matlab implementation of MIL-RL as in 
[7] and the same Matlab implementation of mi-SPSVM tested in [8]. 

As for the segmentation process, we have adopted a procedure 
similar to that one used in [6]. In particular, we have reduced the 
resolution of each image to 128 x 128 pixels dimension and we have 
grouped the pixels in appropriate square subregions (blobs). In this 
way, each image is represented as a bag, while a blob corresponds to 
an instance of the bag. For each instance (blob), we have considered 
the following 10 features: 


e the average and the variance of the grey-scale intensity of the 
blob: 2 features; 

e the differences between the average of the grey-scale intensity 
of the blob and that ones of the adjacent blobs (upper, lower, 
left, right): 4 features; 

e the differences between the variance of the grey-scale inten- 
sity of the blob and that ones of the adjacent blobs (upper, 
lower, left, right): 4 features. 


Since our approach is of the binary type, we have performed the 
following X-ray chest images classification: 


e bacterial pneumonia (positive) images versus normal (nega- 
tive) images (Table 1); 
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Figure 1: Examples of chest X-ray images of healthy people 


Figure 2: Examples of chest X-ray images of people with bacterial pneumonia 


ENPNEME 


Figure 3: Examples of chest X-ray images of people with viral pneumonia 


e viral pneumonia (positive) images versus normal (negative) 
images (Table 2); 

e viral pneumonia (positive) images versus bacterial pneumonia 
(negative) images (Table 3). 


In particular, in order to consider different sizes of the testing and 
training sets, we have used three different validation protocols: the 
5-fold cross-validation (5-CV), the 10-fold cross-validation (10-CV) 
and the Leave-One-Out validation. As for the optimal computation 
of the tuning parameter C characterizing the models (2) and (5), in 
both the cases we have adopted a bi-level approach of the type used 
in [4] and in [7]. 

In Tables 1, 2 and 3 we report the average values provided by 
MIL-RL and mi-SPSVM in terms of correctness (accuracy), sensi- 
tivity, specificity and F-score, computed on the testing set. We also 
report the average CPU time spent by the classifier to determine the 
separation hyperplane. 

We observe that mi-SPSVM is clearly faster than MIL-RL and, in 
general, it classifies better, even if the accuracy results provided by 
the two codes appear comparable. In classifying bacterial pneumonia 
(Table 1) and viral pneumonia (Table 2) against normal X-ray chest 
images we obtain high values of accuracy (about 90%) and sensitivity 
(about 94%). We recall that the sensitivity (also called true positive 
rate) is a very important parameter in diagnostics since it measures 
the proportion of positive patients correctly identified. 

On the other hand, when we discriminate between the viral pneu- 
monia and the bacterial pneumonia images (Table 3), we obtain 
lower results with respect to those ones reported in Tables 1 and 2, as 
expected since the two classes are very similar. Nevertheless, these 
values appear reasonable, especially in terms of sensitivity (82% 
provided by mi-SPSVM) and of F-score (75.93%), if we consider 
that no preprocessing procedure has been applied to the images. 


IDEAS 2021: the 25th anniversary 


4 CONCLUSIONS AND FUTURE WORK 


In this work we have presented some preliminary numerical results 
obtained from classification of viral pneumonia against bacterial 
pneumonia and normal X-ray chest images, by means of Multiple 
Instance Learning (MIL) algorithms. These results appear promis- 
ing, especially if we take into account that no preprocessing phase 
has been performed. Moreover our MIL techniques appear appeal- 
ing also in terms of computational efficiency, since the separation 
hyperplane is always obtained in less than one second. 

Future research could consist in appropriately preprocessing the 
images and in considering additional features to be exploited in the 
classification process, including also COVID-19 chest X-ray images. 
In fact the ambitious goal is to create a framework that can quickly 
support the diagnostics of COVID-19. 
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Table 2: Data set constituted by 50 normal X-ray chest images and 50 X-ray chest images with viral pneumonia: average testing 
values 
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Table 3: Data set constituted by 50 X-ray chest images with bacterial pneumonia and 50 X-ray chest images with viral pneumonia: 
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ABSTRACT 


In this work, we used the Web IndeX system, which converts words 
into hyperlinks in arbitrary Web pages, to implement a system 
for sharing annotations registered on keywords within a limited 
group. This system allows group members to view all the written 
annotations by simply mousing over the keyword when it appears 
on a web page, facilitating sharing information and awareness in 
collaborative research and work. In this study, we define our system 
as a sticky note type annotation sharing system. In contrast to the 
sticky note type, our system is positioned as a functional type, but 
it may display annotations that are not necessary since it allows 
browsing on any page. To improve this point, we propose a usability 
judgment method that shows only the most useful posts. 


CCS CONCEPTS 


- Human-centered computing — Collaborative and social 
computing systems and tools. 


KEYWORDS 


Annotation, Web IndeX, Computer supported cooperative work 
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1 INTRODUCTION 


In recent years, with the spread of the Internet, many informa- 
tion sharing tools have been developed and are being widely used, 
such as e-mail, social media, the cloud, etc. However, in order to 
share each piece of information in a collaborative research or work 
environment, it is necessary to save or transmit each piece of in- 
formation to a dedicated tool or space. This makes the information 
sharing process very time-consuming. We thought that if informa- 
tion that we wanted to share appeared on a Web page, we could 
write a memo that could be shared directly with everyone on that 
word, and that would make collaborative work and research more 
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efficient. Therefore, in this work, we propose a system that shares 
the messages among a specific group by directly annotating the 
words in the Web page. 

Various tools exist for the purpose of sharing annotations. Many 
of these tools register annotations on specific Web pages and doc- 
uments or on words within these documents. In that case, it is 
difficult to share the knowledge about the word when you view 
the word on another page after registering the annotation for that 
word. Therefore, we thought that using a system that could register 
annotations on the words themselves and view them on any Web 
page would make collaborative work and research more efficient. 

In this work, we call systems that register annotations only on 
a specific Web page or document, as is most commonly done in 
current tools, sticky note type annotation sharing systems. On the 
other hand, we propose a functional type annotation sharing system 
that can share annotations with words in any arbitrary page. The 
proposed system implements a function to annotate a specific word 
in a database and applies the function to words. The argument of 
the applied function is an arbitrary Web page. Based on this, it was 
positioned as a function type. 

In order to implement a functional annotation sharing system, 
we used the Web IndeX system [7, 9], which is developed by Toyama 
laboratory, Keio Univesity. The web IndeX (WIX) system generates 
hyperlinks from keywords in Web documents to make it easier for 
viewers of Web pages to access other Web pages related to words 
appearing in the page being viewed. The WIX system uses a WIX 
file that describes an entry in XML format by combining a keyword 
and a URL. 

Functional systems can browse annotations efficiently because 
annotations added on a specific page can be viewed on any page. 
However, there are times when annotations that are not necessary 
or unreliable are displayed when browsing that web page. To solve 
this problem, we propose a method to rank the annotations and 
display only the top few annotations when sharing. 

The structure of this paper is as follows. First, Section 2 gives an 
overview of the Web IndeX system. Section 3 describes the annota- 
tion sharing system, and Section 4 describes the evaluation. Section 
5 is about improving the system. Section 6 describes related work 
on annotation systems and usability judgment method. Section 7 
discusses conclusion and future work. 


2 WEB INDEX SYSTEM 


2.1 Overview 


The Web IndeX system is a system that joins information resources 
on the Web. In general, current web pages have a structure in which 
a specific link destination is attached to a specific anchor text. On 
the other hand, the WIX system uses a WIX file that describes 
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a set of entries, which are pairs of keywords and URLs, in XML 
format. By referring to the WIX file, the WIX system convert key- 
words appearing in the text of any Web page into hyperlinks to the 
corresponding URL. 


2.2 WIX File 


A WIX file is a file that describes combinations of keywords and 
URLs in XML format. A keyword that will serve as an anchor for 
the links is described in a keyword element, and the URL of the 
corresponding Web page is described as a target element. These 
two are combined and called an entry. In the header element, it is 
also possible to describe the outline of the file, metadata, comments 
of the creator, etc. WIX files contain entries that share something in 
common, such as "List of English Wikipedia entry words", "Official 
professional baseball sites", and "English-Japanese dictionary". An 
example of a WIX file is shown below. 


<wWwIX> 
<header > 
<comment >example </comment > 
<description>Apr 2020, saeki</description> 
<language >en</ language > 
</header > 
<body> 
<entry> 
<keyword>Messi </keyword> 
<target >https://www. fcbarcelona.com/en/football/ 
first-team/players/4974/lionel-messi</target> 
</entry> 
<entry> 
<keyword>Nishikori </keyword> 
<target >https://www.atptour.com/en/players/kei- 
nishikori/n552/overview</target> 
</entry> 
</body> 
</WIX> 


2.3 Architecture 


Figure 1 shows the architecture of the Web IndeX system. The WIX 
system converts keywords received from the client into keyword 
hyperlinks on the server. 


ne | wid | eid | keyword target 
1 il 


Google http://en.Wikipedia...... 
http://en.Wikipedia...... 
http://en.Wikipedia...... 


| 1 2 Apple 


Automaton construction 2 1 blockchain 


q @ lexical matching 


Google’s ==> 
Al phone assistant Duplex is--- @Attachment 


<a 
href="https://en.Wikipedia--:’>Google</a>'s 
Al phone assistant Duplex is--- 


Server 


° | SORRY 


©, CLOSEp 


° | SORRY * 


@, CLOSED 


Wikipedia-en al 


EY 


aoe XI Ons 


@Click the button on the toolbar 


Figure 1: Architecure of the WIX system 


2.4 WIX library 


The WIX library stores all WIX files and manages information on a 
file-by-file basis. WIX file creators can upload WIX files through 
the WIX library, and the uploaded WIX files are decomposed into 
entry units and stored in the WIX DB. 
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2.5 Find Index 


Find Index is an automaton that expands entry information in 
WIX DB into memory and realizes high-speed attachment. The 
WIX system constructs an automaton based on the Aho-Corasick 
method [8] and performs lexical matching. 


2.6 Convert into Hyperlink 


The client side of the WIX system is implemented as a Chrome 
extension. When this extension is installed in Chrome, a toolbar 
like the one shown in Figure 2 is displayed at the bottom of the 
browser. Clicking on the WIX filename button in this toolbar will 
convert the words in the web page into corresponding hyperlinks. 
This combining operation is called attachment. Figure 3 and Figure 
4 show the web page before and after attachment using the WIX 
file of English Wikipedia. For example, we used a news article [10] 
from BBC NEWS. 


Wikipedia-en Blog Company EJ Dict ie C) 3 


&e Wikipedia-ja 


Figure 2: WIX toolbar 
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against coronavirus 
Google’s Al phone assistant Duplex is contacting businesses across the UK 
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It is using the responses to update company listings shown on Google Search and 
Google Maps. 


Figure 3: Before Attachment 
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Figure 4: After Attachment 
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Figure 5: Toolbar of Annotation Sharing System 
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Figure 6: Popup menu of Annotation Sharing System 
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Figure 7: The Whole Picture of Annotation Sharing System 


3 PROPOSED ANNOTATION SHARING 
SYSTEM 


3.1 Overview 


This proposed annotation sharing system can be used by anyone 
who installs it as a chrome extension. When this system is installed, 
a toolbar as shown in Fig. 5 is displayed at the bottom of the chrome 
browser, and a popup menu as shown in Fig. 6 can be used by 
clicking the button displayed on the right side of the address bar. 

Fig. 7 shows the relationship between users and groups in this 
system and the whole picture of this system. Users can join groups 
and share annotations on words within the group. A group here is 
a community that can share the same annotation, and members of 
the group can register annotations and browse annotations in the 
group. Any user of this system can create new groups. Users can 
also send a notification of the annotation content to the registered 
Slack channel when the annotation is registered. 

The method of performing lexicographic matching on sentences 
on a web page and converting the words registered in annotations 
is Find Index used in the WIX system. On the server side, in the 
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Table 3: Join table 


| group_id | user_id | 
1 2 
1 3 
2 1 
3 1 


WIX system, keywords were converted to hyperlinks when attach- 
ing, whereas in this system, HTML was converted to HTML with 
annotation balloons displayed when hovering over keywords. Also, 
in the WIX system, the Find Index was manually constructed after 
newly registering the WIX file, but this system assumes that the 
construction will be performed every 5 minutes. If updated every 
5 minutes, the annotations registered during that time cannot be 
immediately reflected on the Web. Therefore, in this system, a func- 
tion to immediately notify the annotation registration on Slack was 


added. 


3.2 Data management 


In this section we describe how data about the users, user groups 
and annotations are stored on the WIX system server. Data is man- 
aged in a relational database. We describe the data schema used 
and illustrate it with examples. 


3.2.1. User and Group DB. User and group management uses the 
user table, group table, and join table. The user table stores the user 
ID (id) and the user name (name). The join table stores the group 
ID (group_id) And the user ID (user_id) are stored.The group table 
stores the group ID (id), group name (name), password for group 
participation (pwd), and slack link URL (slack). Table 1,2, 3 show 
examples of these tables. 

These examples show that Mary and Harry are participating 
‘database’ group and John is participating the ‘network’ group and 
"AT group. Users can join multiple groups. 


3.2.2. Annotation DB. Annotations are managed using the annota- 
tion table and the wix_file_entry table, which is also used in the 
WIX system. The annotation table stores the group ID (wid), word 
ID (eid), user ID (uid), annotation (annotation), and registered date 
and time (added_time).Table 4,5 show examples of these tables. 

These examples show that ’Blockchain’ has 2 annotations and 
Voice recognition’ and ’Autonomous car’ have an annotation. Mul- 
tiple annotations can be registered for the same word. 


3.3 Applying Annotations 


As with the Web IndeX system, annotation sharing is performed 
by clicking a button on the toolbar. Figure 8 shows an example of 
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Table 4: Annotation table 


Table 5: wix_file_entry table 


keyword target 
Blockchain annotation 
Voice recognition | annotation 
Autonomous car | annotation 
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at 
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Blockchain may still be in its infancy, but venture capitalists are already pouring 
billions into start-ups with more clearly defined plans than the Long Island Iced Tea 
Company's. Even Facebook is getting involved. 
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Figure 8: Sharing Annotation 


a web page after sharing. All the words registered in the group 
selected in the database turn green, and when you mouse over the 
word, the annotation content is displayed in a black balloon. The 
example of Fig. 8 used BBC NEWS [11]. 


4 EVALUATION 


This work compared media and the target to be annotated between 
multiple similar tools and this proposed system. 

In addition, we performed a numerical experiment and compared 
how annotations were shared under the assumption that this system 
and similar tools were used in the same environment. 


4.1 Comparison with similar tools 


Diigo [4], Annotate [2], and Adobe [1] used as similar tools for 
comparison. Table 6 shows the result of comparison in terms of 
supported formats. Table 6 shows that the proposed system has 
fewer supported formats than A.nnotate. 
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Table 6: Camparison with similar tools in terms of sup- 
ported formats. 


System Supported Formats 
Proposed System | HTML 

Diigo HTML 

A.nnotate PDF, Word, HTML 
Adobe PDF 


Furthermore, all of these tools can share annotations instantly. 
Since the proposed system is functional, it has a feature that it can 
be applied to non-specific pages that are not included in other tools 
that are sticky note type. 


4.2 Numerical Simulation 


4.2.1 Setting Conditions. A numerical simulation to evaluate this 
functional system was performed under the following hypothetical 
conditions. We assume a sticky note type system that registers 
annotations for words on a specific Web page only, and use it as a 
comparison target system. 


e 100 participants 

e There are n pages that can be viewed (n: 10, 50, 100, 500, 
1000) 

e View one page per person per day (select randomly) 

e Annotation is registered for m words (m: 2, 5, 25) 

e Use the average number of annotation types viewed by each 
participant on each day. 

e Calculate results on days 10, 30, and the last day. 


This experiment calculate the viewing rate from the average 
number of annotation types viewed by each participant. The view- 
ing rate here defines how many types of annotations among all 
types of annotations were viewed in each condition. 


4.2.2 Result. First, a comparison based on the number of pages 
which have an annotation is performed. On day 10, the annotation 
viewing rate of the sticky type system and the functional type 
system when m = 2 was as shown in Fig. 9. The result when m = 25 
is shown in Fig. 10. In the case of 25 types, if the number of pages 
is less than 25 pages, it is not possible to view all of them, so the 
graph excluding the case of n = 10 is shown. 
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Figure 9: Result of m = 2, Day 10 
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Figure 10: Result of m = 25, Day 10 


From these two graphs, it can be seen that the viewing rate of 
the functional system is much higher than that of the sticky system 
in both cases. In addition, the viewing rate of the sticky type system 
decreases greatly with the increase in the number of pages, but 
the viewing rate of the functional system does not change much. 
Therefore, it can be inferred that the more pages, the more efficient 
the functional system can share annotations. 

Second, we compare by period. Fig. 11 shows the viewing rate 
of both systems with m = 5 and n = 100. From this graph, it can be 
seen that the browsing rate increases as the period increases in both 
systems. At this time, the increase in the viewing rate was large 
in the sticky note system, but the viewing rate in the functional 
system was close to 100% on day 10, so the increase rate on day 50 
was very small. 

From this, when using for a long period of time, it is possible to 
view annotations sufficiently even with a sticky note type system, 
but when using for a short period of time, it is more difficult to find 
annotations with a sticky note type system than with a functional 
type system. 

Finally, we compare the number of annotation words. On day 30, 
the viewing rate by both systems with n = 100 is shown in Fig. 12. 
From this graph, it can be seen that the viewing rate increases as the 
number of words increases in the sticky note type system, whereas 
the viewing rate decreases greatly as the number of words increases 
in the functional type system. This is because the number of pages 
on which a word exists in a sticky note type system increases, while 
in a functional type system, one word exists on n / m pages. 
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Figure 11: Result of m = 5, n = 100 
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Table 7: Comparison of sticky note type and functional type. 


Weakness 


sticky Precise communication | Fewer opportunities 
note type | of intent. to browse. 

functional | Efficient sharing. Posts that are not useful 
type will be shown. 
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Figure 12: Result of n = 100, Day 30 


From the above comparison, it was found that the viewing rate 
of the functional system exceeded the viewing rate of the sticky 
note system in any case. This difference in the viewing rate can be 
said to be the oversight rate of annotations when using the sticky 
note type system. By using this functional system, it is possible to 
reduce oversight of annotations given by members of the group 
and users themselves, and to utilize accumulated annotations more 
effectively. 


5 IMPROVING THE SYSTEM 


5.1 Comparison of sticky note type and 
functional type 


The annotation sharing system described in the previous section 
is a functional type that registers annotations to words on an arbi- 
trary page. Table 7 shows the results of comparing the annotation 
characteristics of the functional type and the conventional sticky 
note type annotation sharing system. 

Focusing on the disadvantage of functional types, that is, the dis- 
play of unhelpful postings, we conducted a preliminary experiment 
to see from what perspective the usefulness was judged. 


5.2 Preliminary experiments 


In the preliminary experiment, ten members of our laboratory 
actually submitted their annotations over a period of two weeks. 
In order to facilitate the comparison of annotations for the same 
word, they focused on two specified news articles as the target web 
pages for the experiment. 
From this experiment, we found the following. 
e Annotations with abstract or emotional content do not make 
sense when viewed on other pages. 
e Users want to know on which page it was originally anno- 
tated. 


300 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


e Many posts contain URLs to other pages. 


Here, "abstract or emotional content" refers to annotations that 
do not refer to the target, such as difficult or interesting. In this 
experiment, we found three possible problems in the functional 
annotation sharing system. First, when the URL of another page is 
included in the annotation, is the page reliable? Second, annotations 
that have been around for a long time are not as useful as those 
that are more recent. Finally, if the annotation is not related to the 
word to which it is attached, its usefulness in browsing other pages 
is reduced. 


5.3 Usability judgment method 


Based on the problems identified in the previous section, we evalu- 
ate the annotation submissions from multiple viewpoints and judge 
their usefulness. Then, we propose a method of displaying only the 
top few annotations ranked by the judgment results when they are 
shared. 

The following three criteria are used to judge the usefulness of 
this method. 


e Time between submission and viewing. 

e Relevance of annotations, added words, and the content of 
the page being viewed. 

e Credibility of the page if the post contains URLs of other 
pages. 


As the second criterion, we propose a method to determine the 
relevance of annotations and added words to the contents of the 
page being viewed. We first considered the relevance of annotations 
and added words, then annotations and browsed pages, but found 
that words and annotations are short and the accuracy of determin- 
ing relevance is low. Therefore, we decided to use the relevance 
of the content of the page at the time of annotation registration 
and the page at the time of browsing. To determine the relevance, 
we measure the proximity of the topics of each content. To obtain 
these topics as numerical values, we use LDA [5], which is a topic 
model method. The number of topics at this time is not fixed, and 
we plan to evaluate the difference in results depending on the num- 
ber of topics. In this work, we used LDA, which is an approach 
based on the frequency of occurrence of words, but we will also 
implement methods using Word2vec, which is an approach based 
on the distributed representation of words, and Doc2vec, which is 
an approach based on the distributed representation of sentences. 

In the case of annotations with abstract or emotional contents 
mentioned in the problems, we are also considering a method to add 
a summary of the web page where the annotation was posted at the 
time of sharing. In this case, we plan to extract words expressing 
emotions from the annotation contents using natural language 
processing. 


5.4 Embedding judgement method into the 
system 


The first step is to calculate the topic of the web page that was used 
when the annotation was submitted. The values of the topics are 
stored in the database along with the contents of the annotation. 
When sharing the annotations, we calculate the closeness between 
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the topic of the web page being viewed and the topic of each word 
appearing in the web page. 


6 RELATED WORK 


6.1 Annotation sharing system 


PAMS2.0 [13] developed by Su et al. is an annotation sharing sys- 
tem similar to this work. This system was developed mainly for 
education and aims to deepen understanding by writing and shar- 
ing annotations while reading the same teaching material within a 
limited group of students. In PAMS2.0, it is possible to chat individ- 
ually or discuss with the whole group. It is also possible to attach 
annotations to tags such as questions and answers, and write text 
directly on the page. 

Similarly, a system for educational purposes is MyNote [3] devel- 
oped by Yu-Chien Chen et al. This system is built into the e-learning 
management system (LMS). It can annotate learning objects in the 
LMS, and also annotate Web documents from the LMS. 

In contrast to the above research, which is a sticky note type that 
adds an annotation to a specific page, the system of this research is 
a function type that registers an annotation in the word itself on 
an arbitrary page. 


6.2 Usability judgement method 


In terms of evaluating the contents posted by an unspecified number 
of users, the studies related to this research include the detection 
of fake news in Twitter posts and the evaluation of reviews on 
e-commerce sites. 

Siva Charan Reddy Gangireddy et al’s research [6] uses graph- 
based unsupervised learning by assuming the behavior of users 
who send fake news. Qiang Zhang et al. [14] proposed a Bayesian 
deep learning model for fake news detection by considering the 
content and time series of posts and replies. Debanjan Paul et al. 
[12] proposed a review evaluation method using a dynamic convo- 
lutional neural network, focusing on the fact that the conventional 
review evaluation method for e-commerce sites uses other users’ 
ratings and cannot correctly evaluate reviews with few ratings from 
other users. They also used natural language processing to deal 
with the fact that different words are used when reviewing the 
same aspect. 


7 CONCLUSION AND FUTURE WORK 


In this paper, we implemented a functional annotation sharing sys- 
tem to facilitate sharing of information and recognition within a 
limited group. By annotating the words themselves rather than spe- 
cific pages, the viewing rate for using annotations more effectively 
was increased. 

However, since functional types may display unhelpful posts, we 
are developing a usefulness judgment method to solve this problem. 

Currently, we are only implementing the second criterion, which 
is to judge the relevance of annotations, added words, and the 
content of the page being viewed. Therefore, in order to improve 
the accuracy of the usefulness judgment, it is our future task to 
implement the other three criteria in the system. In addition, we 
will also consider how to weight and rank the data calculated by 
each criterion. For the first point, judging the usefulness based on 
the period between posting and browsing, we plan to implement 
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a system that obtains the date and time of posting and the date 
and time of browsing, and calculates the difference. As for the 
third point, the reliability of other pages, we will consider how to 
determine this in the future. 
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ABSTRACT 


The social media and electronic media has a vast amount of user- 
generated data such as people’ comment and reviews about different 
product, diseases, government policies etc. Sentimental analysis 
is the emerging field in text mining where people’s feeling and 
emotions are extracted using different techniques. COVID-19 has 
declared as pandemic and effected people’s lives all over the globe. 
It caused the feelings of fear, anxiety, anger, depression and many 
other psychological issues. In this survey paper, the sentimental 
analysis applications and methods which are used for COVID-19 
research are briefly presented. The comparison of thirty primary 
studies shows that Naive Bayes and SVM are the widely used algo- 
rithms of sentimental analysis for COVID-19 research. The applica- 
tions of sentimental analysis during COVID includes the analysis 
of people’s sentiments specially students, reopening sentiments, 
analysis of restaurants reviews and analysis of vaccine sentiments. 
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- Computing methodologies — Classification and regression 
trees; - Applied computing — Multi-criterion optimization 
and decision-making; - Information systems — Sentiment 
analysis; -Human-centered computing — Heat maps; « Social 
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1 INTRODUCTION 


Today, people usually make review on internet to express their 
opinion about the service, things they use or about any trending 
topic, [34]. With the advancement of technologies and industri- 
alization era, a number of industries have been setting up and 
appealing the customers to buy their products online and also make 
review to highlight their positive and negative experience about 
their products. The overwhelming of review data on internet has 
increased the chances of extracting valuable information from it in 
the form of sentiments [15] [16]. Sometimes, people express multi- 
ple sentiments polarities in a single text. It is good for a business or 
organization to understand the sentiments of the customers. Senti- 
mental analysis is natural language processing (NLP) task which 
usually understands the meaning of the text [14] and classify in 
positive, negative or neutral sentiments of different aspects in a 
text. The aspect level sentimental analysis is the fine-grain task in 
sentimental analysis and provides a complete sentimental expres- 
sion [34]. For instance, "The chair is comfortable but its price is 
high". In this example, "comfortable" and "high" represent positive 
and negative sentiments respectively. The advancement in artificial 
intelligence techniques has eliminates the traditional methods of 
sentimental analysis. 

After World War II, COVID-19 has considered as the biggest 
problem faced by whole world. The coronavirus was first spotted 
in China in December 2019 and has spread in the whole world. 
According to John Hopkins University, more than 130 million people 
have been affected and 2,861,677 number of deaths have been 
occurred due to COVID-19 till the first week of April 2021. The 
destabilization caused by COVID has created the multiple emotions 
in people such as fear, anger, anxiety, depression even hostility [31]. 

The two ways to solve sentimental classification tasks are tra- 
ditional machine learning methods and deep learning methods. 
The traditional methods usually use classifiers i.e. SVM (Support 
Vector Machine) and Naive Bayes for this purpose while in deep 
learning methods Recurrent Neural Network (RNN) and Convolu- 
tional Neural Network (CNN) have widely used in NLP tasks. Deep 
learning is becoming famous in this regard as it perform feature 
engineering and extract meaning features automatically [34]. The 
literature shows that CNN does not perform well in capturing the 
sequential dependencies and RNN suffers from gradient vanishing 
and parallel-unfriendly problems due to its recurrent nature. This 
indicates that the existing approaches for sentimental classifica- 
tions have shorts comings such as low performance in terms of 
accuracy and recall and high training timings etc. When there are 
multiple aspects with inconsistent sentiment polarity in a sentence, 
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the dependencies between the words will be weaken due to increase 
in distance. In such case, the attention mechanism can put its focus 
on the most important information. 

In this study, we have performed the survey of thirty primary 
studies related to sentimental analysis during COVID-19 pandemic 
and figure out the techniques that have been applied in order to 
classify the sentiments of the people as well as the application areas 
of sentimental analysis during COVID research. The objectives of 
this survey are to identify the data sources and data volume of 
sentimental analysis during COVID-19, to identify the mostly used 
approaches and the applications of sentimental analysis during 
COVID. This study also presents the future implications of research 
with respect to COVID. 


2 METHODOLOGY: 


The review of thirty primary studies has been conducted in this 
study as shown in Table 1. In Table 1, benchmark data sets and 
well known data sources are mentioned in column 2 to help the 
researchers or readers in getting similar kind of data. The volume 
of data used in individual study has been mentioned in column 
3. The Column 4 specifies the types of approaches or techniques 
which have been widely used during COVID for sentimental anal- 
ysis and classification. During COVID, sentimental analysis was 
performed over different application areas which have been illus- 
trated in column 5 of Table 1. This is the most important aspect of 
this surveys as it can open new research directions or topics for 
future researchers. The future trends and implications have been 
presented in column 6. 


2.1 Data Sources during COVID-19 research: 


Sentimental analysis is considered as a sentimental classification 
task. During COVID-19 pandemic, people experience different emo- 
tions and express their emotions using different social media plat- 
form. The social media platforms are the rich source of information 
as well as data in order to figure out the people’s reactions and 
feelings during the destruction of COVID. Table 1 shows that the 
biggest data source for the research during pandemic was twit- 
ter.The statistics shows that 24 out of 30 studies uses twitter as 
a data source while other sources of data are online media and 
forums, Weibo account, WeChat account, Reddit, Yelp, RateMDs, 
HealthGrades, and Vitals and Qingbo Big Data Agency. The infor- 
mation which can be found on these popular social media is given 
in table ??. 

Twitter: Twitter is considered as most popular social media 
platform having almost 81.47 million registered users [1]. People 
share message, that are called "tweets", related to public and global 
situations ultimately turning the twitter into data hotspot for web- 
based media conversation. In a single day, people post about 500 
million tweets which results in 200 billion tweets posted per year 
[4]. The tweets are grouped based on their topics such as political 
matters, personal opinion, national economic issues, COVID-19 
pandemic [9]. 

WeChat: WeChat is the Chinese multi-purpose social media and 
messaging platform. It has one billion monthly active users which 
regarded the WeChat as most popular social media platform. 
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2.2 Approaches for COVID Sentimental 
Classification 


With the rise of big data, there is a need to develop efficient analytics 
tools [10].Sentimental Classification Approaches, during COVID-19 
research, can be divided into three types. Machine learning based 
approaches, lexicon based approaches and hybrid approaches. 


2.2.1. Machine Learning Approaches: The machine learning based 
approaches use the famous ML algorithms for the SC during COVID. 
They further consists of two categories i.e. supervised and unsu- 
pervised learning methods. 

Supervised Learning Methods: In the supervised learning 
methods, the instances of the data are labelled already [30]. Various 
supervised learning methods have been used in literature for the 
sentimental classification in COVID-19 related research as seen in 
Table 1. 

Naive Bayes: Naive Bayes is one of the supervised learning algo- 
rithm and have been used in [1] and [27]. It works on the principle 
of Bayesian theorem given in equation 1. 


P(H|X) = P(X|H)P(H)/P(X) (1) 


Support Vector Machine: SVM is the statistical learning based 
machine learning algorithm that works by converting feature space 
into high dimensional features in order to find the hyperplane. It is 
used by [1], [20], [25] and [32]. 

Decision Tree and Random Forest: Decision tree is the machine 
learning algorithms that trains its model to predict the class values 
based on simple decision rules found in entire train dataset. Random 
Forest belongs to the family of decision tree and works by choosing 
random features as well as random instances. It has been used by 
[1], [12], [25] and [32]. 

Other supervised learning approaches used in SC for COVID 
research are KNN [1], Linear Regression [1], Logistic Regression 
[27], [32], LSTM [11], [25], RNN [19] and BERT model [4], [18] etc. 

Unsupervised Learning Methods: In unsupervised learning 
methods, the data is not labelled. The unsupervised methods have 
been used in SC related to COVID. K-means clustering has been 
used by [6] while Latent Dirichlet Allocation (LDA) method has 
been used in many studies i.e. [5], [9], [22], [29], [32], [33], [35]. 


2.2.2 Lexicon Based Approaches: Two types of opinion words are 
used to express the feelings i.e. positive opinion words and neg- 
ative opinion words which are used to express likes and dislikes 
respectively. Different approaches are used to collect the opinion 
words list. 

Dictionary-based approach and Corpus-based approach 
Dictionary based and corpus based approaches have been used 
in [3] in order to perform the sentimental analysis during COVID- 
19 

Natural Language Processing Natural language processing 
(NLP) is used along with lexicon based methods in order to find the 
semantic relationship in a sentence. It has been used in [2] for the 
mental health analysis of students during COVID. In [19] and [23], 
NLP has been used to Analyze sentiments and characteristics of 
Covid-19 respectively. [33] has also used NLP to examine COVID- 
19-related discussions, concerns, and sentiments using tweets. 
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2.3 Sentimental Analysis Applications during 
COVID-19 


COVID-19 has attracted the researchers in the area of sentimen- 
tal classification as COVID has effected people’s behaviours and 
attitudes in many ways. There are various topics that work under 
sentimental classification during COVID-19. 


2.3.1. Sentimental analysis on palliatives distribution during COVID- 
19. It is responsibility of any government to maintain the sustain- 
ability of any country. During COVID-19, people needed help in 
order to lessen the economic as well as psychological stress. For 
this purpose, different governments releases the relief packages 
and certain other bonuses. In developing countries, monitoring the 
public funds transparency is a challenge. Therefore, it is necessary 
for the government to analyse the people’ reaction and sentiments 
on the palliative distribution as it will indicate the reach of funds 
and its impact on people’s circumstances during COVID-19. In [1], 
Adamu et al. performed sentimental classification on on Nigerian 
Government COVID-19 Palliatives Distribution 


2.3.2 Public sentiments and mental health analysis of students dur- 
ing the lockdown. COVID-19 has stopped the lives of people as it 
spreads with the human-to-human interaction. To stop the spread 
of coronavirus, it is necessary to impose such steps which tend to 
stop the movement and results in lesser human-to-human interac- 
tion. The one measure which was adopted by almost all the states 
of the world is "lockdown", resulting in closing airspace, closing ed- 
ucational institutions and workplaces, closing public transport etc. 
Hence, these implications have caused sadness, loneliness, anxiety, 
fear and many other psychological issues in peoples specially in 
students. Some of the students have stuck in their hostels, far away 
from their hometowns, few students are worried because of their 
exams and educational activities. Thus, the lockdown has effected 
people’s lives and emerged many physiological issues like depres- 
sions. During these days, people are using social media in order to 
express their feelings and emotions. These social media posts such 
as tweets can be analyzed and helps the researchers to understand 
the state-of-mind of the citizens [5], [22], [3] and students [2]. In 
[6], [19], [24], [25], [27], [23], [13], [29], [33], [9], [28] sentimental 
analysis of people’s behaviour and attitudes during COVID-19 has 
been performed using twitter data. In [32], the sentimental analysis 
of tweets is carried out with respect to the age of the social media 
users and they found out the extent of tweets is higher in youth 
during COVID. 


2.3.3. COVID-19 reopening sentiments: With the paradigm shift 
due to COVID-19, billions of people’s life has been effected directly 
or indirectly. COVID-19 has induced the feelings of fear, anxiety 
as well as economical crisis, which altogether are the challenges 
towards the reopening after COVID [26]. Long-term lock-down 
is not a solution, instead a threat for the economy of any country. 
COVID-19 has effected the life of the students as well as the working 
people. Considering this situation, everyone is craving for going 
back to normal life and physical activities [17]. Hence, in [17] and 
[26], the researchers tried to analyse the sentiments of the people 
towards reopening after COVID-19 disasters. 


IDEAS 2021: the 25th anniversary 


IDEAS 2021, July 14-16, 2021, Montreal, QC, Canada 


2.3.4. Analyzing online restaurant reviews. This era of e-commerce 
has enabled the customers to led a satisfy and quality life. The on- 
line reviews has helped the customers in decision-making. Online 
reviews are important for the restaurants as well because they are 
aligned with the star rating and one-star increase can earn a good 
revenue for the restaurant. During COVID-19, many restaurants 
got negative reviews for being the cause of COVID-spread, for not 
proper heating the outdoor area or for slow service. Therefore, it is 
important to analyse the customers’ sentiments in order to improve 
their quality. The researchers analyzed the customers’ sentiments 
which in-turn helped the customers as well as restaurants manage- 
ment to get good quality food and environment and maintain high 
quality respectively [12]. 


2.3.5 Vaccine sentiments and racial sentiments. In [18], researchers 
used concept drift in order to classify the sentiments of people 
associated with COVID vaccine. During COVID, a rise in prejudice 
and discrimination behaviour against Asian citizens have been 
seen. The researchers tries to describe variations in people attitudes 
towards racism before and after COVID-19 [20]. 


2.4 Future implications of research about 
COVID 


Various future implications have been given below: 


e Consider huge amount of data and explore multiple data 
sources and platforms. 

e Analyze sentiments expressed in multiple languages. 

e Real-time monitoring and visualizations should be performed. 

Consider socioeconomic factors and household information 

while analysing the sentiments of the people. 

Use deep learning approaches in future. 

Sentiments on different age-groups should be performed. 

e Explore public trust and confidence in existing measures and 
policies, which are essential for their well-being. 

e More precise location information can improve spatial anal- 
ysis. 

e More specific topics can be analyzed to help policy maker, 
government and local Communities during any emergency 
conditions. 


3 COMPARISON OF STUDIES 


In this study, the comparison of thirty primary studies have been 
performed and represented in the tabular form in Table 1. 
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Table 1: Sentimental Analysis Approaches and Applications During COVID-19 Pandemic 


Ref Data Volume of Data | Techniques Application Limitation/ Future Direc- 
Source/set tion 
[1] Twitter 9803 Tweets KNN, RF, NB, SVM, DT, | Sentimental —_ analysis on | Large data, consider multi- 
LR COVID-19 Palliatives Distribu- | ple language, real time clas- 
tion sification 
[2] Twitter 330,841 tweets | NLP, bar graph, Mental Health Analysis of Stu- | N/A 
dents 
[4] twitter 3090 tweets BERT Model Classifying the fake tweets N/A 
[5] Twitter 410,643 tweets | Scatter plot, line chart, | Public sentiments during the | Focus on English language, 
LDA lockdown understand the perceptions 
and contexts related to neg- 
ative sentiment 
[3] Twitter 6,468,526 Dictionary-based Sentiment Analysis During | N/A 
tweets. methodology COVID-19 
corpus-based methodol- 
ogy 
[6] Online sur- | N/A clustering algorithm (k- | Understand adults’ thoughts | N/A 
vey means) and behaviors 
[8] Online N/A Correlation analysis Construct a framework of | N/A 
media and COVID-19 from five Dimen- 
forums, sions i.e. epidemic, medical, 
Weibo governmental, public, and 
account, media responses 
WeChat 
account 
[9] Twitter N=1,001,380 Latent Dirichlet Alloca- | Sentimental Analysis, Identify | Population is not repre- 
tion dominant topics during COVID | sented, real-time posting 
[11] reddit 563,079 Com-| LSTM Uncover issues related to | Evaluate other social me- 
ments COVID-19 from public opin- | dia using hybrid fuzzy deep- 
ions learning techniques 
[12] Yelp 112,412 reviews | GBDT, RF, Analyzing online restaurant re- | Different review platforms 
SWEM views and restaurant locations. 
[13] Twitter 20,325,929 CrystalFeel Examine worldwide trends of | Expanding the scope to 
tweets fear, anger, sadness, and joy include other media plat- 
forms. 
[7] Twitter 500,000 tweets | TextBlob Determining polarity and sub- | Explore other social media 
jectivity in COVID tweets. 
[18] Twitter 57.5M English | BERT Concept drift on vaccine senti- | Concept drift in real-time 
ments social media monitoring 
project 
[19] Twitter N/A NLP, RNN Analyze sentiments and mani- | Visualization, clustering 
festations and classification 
[20] Twitter 3,377,295 SVM Changes in racial sentiment Examine longer-term tem- 
poral changes in racial atti- 
tudes 
[21] Twitter 840,000 tweets | TextBlob, LDA, Attitude of Indian citizens while | Analyzing how perception 
discussing the anxiety, stress, | changes for different biogra- 
and trauma phies 
[23] Twitter 57 454 tweets NLP and text analysis | Analyse the characteristics of | N/A 
polish Covid-19 
[24] Twitter 370 tweets subjectivity vs. polarity, | Sentimental analysis for COVID | N/A 
WordCloud 
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Sentimental Analysis Applications and Approaches during COVID-19: A Survey 


4 CONCLUSION 


This survey paper reviewed the thirty primary studies related to 
COVID-19 sentimental analysis and presented comparison with 
respect to sentimental analysis data sources, techniques and appli- 
cations. This article contributes to the sentimental analysis field 
by considering its applications in real-world scenarios. The article 
concludes that the sentimental analysis during COVID-19 is still an 
open fields and contains many interesting topics using advanced 
methods of machine learning and deep learning. This article re- 
ported that Naive Bayes and SVM are the widely used algorithm 
for sentimental analysis during COVID. 
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